-
Notifications
You must be signed in to change notification settings - Fork 48
Expand file tree
/
Copy pathmasking.pymd
More file actions
203 lines (144 loc) · 10.5 KB
/
masking.pymd
File metadata and controls
203 lines (144 loc) · 10.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
# Masking
```python setup, echo=False
import pyrasterframes
from pyrasterframes.utils import create_rf_spark_session
from pyrasterframes.rasterfunctions import *
import pyrasterframes.rf_ipython
from IPython.display import display
import pandas as pd
import numpy as np
from pyrasterframes.rf_types import Tile
spark = create_rf_spark_session()
```
Masking is a common operation in raster processing. It is setting certain cells to the @ref:[NoData value](nodata-handling.md). This is usually done to remove low-quality observations from the raster processing. Another related use case is to @ref:["clip"](masking.md#clipping) a raster to a given polygon.
In this section we will demonstrate two common schemes for masking. In Sentinel 2, there is a separate classification raster that defines low quality areas. In Landsat 8, several quality factors are measured and the indications are packed into a single integer, which we have to unpack.
## Masking Sentinel 2
Let's demonstrate masking with a pair of bands of Sentinel-2 data. The measurement bands we will use, blue and green, have no defined NoData. They share quality information from a separate file called the scene classification (SCL), which delineates areas of missing data and probable clouds. For more information on this, see the [Sentinel-2 algorithm overview](https://earth.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm). Figure 3 tells us how to interpret the scene classification. For this example, we will exclude NoData, defective pixels, probable clouds, and cirrus clouds: values 0, 1, 8, 9, and 10.

Credit: [Sentinel-2 algorithm overview](https://earth.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm)
The first step is to create a catalog with our band of interest and the SCL band. We read the data from the catalog, so all _tiles_ are aligned across rows.
```python, blue_scl_cat
from pyspark.sql import Row
blue_uri = 'https://rasterframes.s3.amazonaws.com/samples/luray_snp/B02.tif'
green_uri = 'https://rasterframes.s3.amazonaws.com/samples/luray_snp/B03.tif'
scl_uri = 'https://rasterframes.s3.amazonaws.com/samples/luray_snp/SCL.tif'
cat = spark.createDataFrame([Row(blue=blue_uri, green=green_uri, scl=scl_uri),])
unmasked = spark.read.raster(cat, catalog_col_names=['blue', 'green', 'scl'])
unmasked.printSchema()
```
```python, show_cell_types
unmasked.select(rf_cell_type('blue'), rf_cell_type('scl')).distinct()
```
### Define CellType for Masked Tile
Because there is not a NoData already defined for the blue band, we must choose one. If we try to apply a masking function to a tile whose cell type has no NoData defined, an error will be thrown.
In this particular example, the minimum value of all cells in all tiles in the column is greater than zero, so we can use 0 as the NoData value. We will construct a new `CellType` object to represent this.
```python, pick_nd
blue_min = unmasked.agg(rf_agg_stats('blue').min.alias('blue_min'))
print('Nonzero minimum value in the blue band:', blue_min.first())
blue_ct = unmasked.select(rf_cell_type('blue')).distinct().first()[0][0]
masked_blue_ct = CellType(blue_ct).with_no_data_value(0)
masked_blue_ct.cell_type_name
```
We next convert the blue band to this cell type.
```python, convert_blue
converted = unmasked.select('scl', 'green', rf_convert_cell_type('blue', masked_blue_ct).alias('blue'))
```
### Apply Mask from Quality Band
Now we set cells of our `blue` column to NoData for all locations where the `scl` tile is in our set of undesirable values. This is the actual _masking_ operation.
```python, apply_mask_blue
from pyspark.sql.functions import lit
masked = converted.withColumn('blue_masked', rf_mask_by_values('blue', 'scl', [0, 1, 8, 9, 10]))
masked
```
We can verify that the number of NoData cells in the resulting `blue_masked` column matches the total of the boolean `mask` _tile_ to ensure our logic is correct.
```python, show_masked_counts
masked.select(rf_no_data_cells('blue_masked'), rf_tile_sum(rf_local_is_in('scl', [0, 1, 8, 9, 10])))
```
It's also nice to view a sample. The white regions are areas of NoData.
```python, display_blu, caption='Blue band masked against selected SCL values'
sample = masked.orderBy(-rf_no_data_cells('blue_masked')).select(rf_tile('blue_masked'), rf_tile('scl')).first()
display(sample[0])
```
And the original SCL data. The bright yellow is a cloudy region in the original image.
```python, display_scl, caption='SCL tile for above'
display(sample[1])
```
### Transferring Mask
We can now apply the same mask from the blue column to the green column. Note here we have supressed the step of explicitly checking what a "safe" NoData value for the green band should be.
```python, mask_green
masked.withColumn('green_masked', rf_mask(rf_convert_cell_type('green', masked_blue_ct), 'blue_masked')) \
.orderBy(-rf_no_data_cells('blue_masked'))
```
## Masking Landsat 8
We will work with the Landsat scene [here](https://landsat-pds.s3.us-west-2.amazonaws.com/c1/L8/153/075/LC08_L1TP_153075_20190718_20190731_01_T1/index.html). For simplicity, we will just use two of the seven 30m bands. The quality mask for all bands is all contained in the `BQA` band.
```python, build_l8_df
base_url = 'https://landsat-pds.s3.us-west-2.amazonaws.com/c1/L8/153/075/LC08_L1TP_153075_20190718_20190731_01_T1/LC08_L1TP_153075_20190718_20190731_01_T1_'
data4 = base_url + 'B4.TIF'
data2 = base_url + 'B2.TIF'
mask = base_url + 'BQA.TIF'
l8_df = spark.read.raster([[data4, data2, mask]]) \
.withColumnRenamed('proj_raster_0', 'data') \
.withColumnRenamed('proj_raster_1', 'data2') \
.withColumnRenamed('proj_raster_2', 'mask')
```
Masking is described [on the Landsat Missions page](https://www.usgs.gov/land-resources/nli/landsat/landsat-collection-1-level-1-quality-assessment-band). It is pretty dense. Focus for this data set is the Collection 1 Level-1 for Landsat 8.
There are several inter-related factors to consider. In this exercise we will mask away the following.
* Designated Fill = yes
* Cloud = yes
* Cloud Shadow Confidence = Medium or High
* Cirrus Confidence = Medium or High
Note that you should consider your application and do your own exploratory analysis to determine the most appropriate mask!
According to the information on the Landsat site this translates to masking by bit values in the BQA according to the following table.
| Description | Value | Bits | Bit values |
|-------------------- |---------- |------- |---------------- |
| Designated fill | yes | 0 | 1 |
| Cloud | yes | 4 | 1 |
| Cloud shadow conf. | med / hi | 7-8 | 10, 11 (2, 3) |
| Cirrus conf. | med / hi | 11-12 | 10, 11 (2, 3) |
In this case, we will use the value of 0 as the NoData in the band data. Inspecting the associated [MTL txt file](https://landsat-pds.s3.us-west-2.amazonaws.com/c1/L8/153/075/LC08_L1TP_153075_20190718_20190731_01_T1/LC08_L1TP_153075_20190718_20190731_01_T1_MTL.txt) By inspection, we can discover that the minimum value in the band will be 1, thus allowing our use of 0 for NoData.
The code chunk below works through each of the rows in the table above. The first expression sets the cell type to have the selected NoData. The @ref:[`rf_mask_by_bit`](reference.md#rf-mask-by-bit) and @ref:[`rf_mask_by_bits`](reference.md#rf-mask-by-bits) functions extract the selected bit or bits from the `mask` cells and compare them to the provided values.
```python, build_l8_mask
l8_df = l8_df.withColumn('data_masked', # set to cell type that has a nodata
rf_convert_cell_type('data', CellType.uint16())) \
.withColumn('data_masked', # fill yes
rf_mask_by_bit('data_masked', 'mask', 0, 1)) \
.withColumn('data_masked', # cloud yes
rf_mask_by_bit('data_masked', 'mask', 4, 1)) \
.withColumn('data_masked', # cloud shadow conf is medium or high
rf_mask_by_bits('data_masked', 'mask', 7, 2, [2, 3])) \
.withColumn('data_masked', # cloud shadow conf is medium or high
rf_mask_by_bits('data_masked', 'mask', 11, 2, [2, 3])) \
.withColumn('data2', # mask other data col against the other band
rf_mask(rf_convert_cell_type('data2',CellType.uint16()), 'data_masked')) \
.filter(rf_data_cells('data_masked') > 0) # remove any entirely ND rows
# Inspect a sample
l8_df.select('data', 'mask', 'data_masked', 'data2', rf_extent('data_masked')) \
.filter(rf_data_cells('data_masked') > 32000)
```
## Clipping
Clipping is the use of a polygon to determine the areas to mask in a raster. Typically the areas inside a polygon are retained and the cells outside are set to NoData. Given a geometry column on our DataFrame, we have to carry out three basic steps. First we have to ensure the vector geometry is correctly projected to the same @ref:[CRS](concepts.md#coordinate-reference-system-crs) as the raster. We'll continue with our Sentinel 2 example, creating a simple polygon. Buffering a point will create an approximate circle.
```python, reproject_geom
to_rasterize = masked.withColumn('geom_4326',
st_bufferPoint(
st_point(lit(-78.0783132), lit(38.3184340)),
lit(15000))) \
.withColumn('geom_native', st_reproject('geom_4326', rf_mk_crs('epsg:4326'), rf_crs('blue_masked')))
```
Second, we will rasterize the geometry, or burn-in the geometry into the same grid as the raster.
```python, rasterize
to_clip = to_rasterize.withColumn('clip_raster',
rf_rasterize('geom_native', rf_geometry('blue_masked'), lit(1), rf_dimensions('blue_masked').cols, rf_dimensions('blue_masked').rows))
# visualize some of the edges of our circle
to_clip.select('blue_masked', 'clip_raster') \
.filter(rf_data_cells('clip_raster') > 20) \
.orderBy(rf_data_cells('clip_raster'))
```
Finally, we create a new _tile_ column with the blue band clipped to our circle. Again we will use the `rf_mask` function to pass the NoData regions along from the rasterized geometry.
```python, clip
to_clip.select('blue_masked',
'clip_raster',
rf_mask('blue_masked', 'clip_raster').alias('blue_clipped')) \
.filter(rf_data_cells('clip_raster') > 20) \
.orderBy(rf_data_cells('clip_raster'))
```
This kind of clipping technique is further used in @ref:[zonal statistics](zonal-algebra.md).