rasterframes/python/docs/aggregation.pymd at develop · locationtech/rasterframes

History

152 lines (105 loc) · 7.42 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

# Aggregation

```python, setup, echo=False

from pyrasterframes import rf_ipython

from pyrasterframes.utils import create_rf_spark_session

from pyrasterframes.rasterfunctions import *

from pyspark.sql import *

import os

import numpy as np

np.set_printoptions(precision=3, floatmode='maxprec')

spark = create_rf_spark_session()

```

There are three types of aggregate functions available in RasterFrames: _tile_ aggregate, DataFrame aggregate, and element-wise local aggregate. In the @ref:[tile aggregate functions](reference.md#tile-statistics), we are computing a statistical summary per row of a _tile_ column in a DataFrame. In the @ref:[DataFrame aggregate functions](reference.md#aggregate-tile-statistics), we are computing statistical summaries over all of the cell values *and* across all of the rows in the DataFrame or group. In the @ref:[element-wise local aggregate functions](reference.md#tile-local-aggregate-statistics), we are computing the element-wise statistical summary across a DataFrame or group of _tiles_. In the latter two cases, when @ref:[vector data](vector-data.md) is the grouping column, the results are @ref:[zonal statistics](zonal-algebra.md).

## Tile Mean Example

We can illustrate aggregate differences by computing an aggregate mean. First, we create a sample DataFrame of 2 _tiles_. The _tiles_ will contain normally distributed cell values with the first row's mean at 1.0 and the second row's mean at 3.0. For details on use of the `Tile` class see @ref:[the page on numpy interoperability](numpy-pandas.md).

```python, create_tile1

from pyrasterframes.rf_types import Tile, CellType

t1 = Tile(1 + 0.1 * np.random.randn(5,5), CellType('float64raw'))

t1.cells # display the array in the Tile

```

```python, showt5

t5 = Tile(5 + 0.1 * np.random.randn(5,5), CellType('float64raw'))

t5.cells

```

Create a Spark DataFrame from the Tile objects.

```python, create_dataframe

import pyspark.sql.functions as F

from pyspark.sql import Row

rf = spark.createDataFrame([

Row(id=1, tile=t1),

Row(id=2, tile=t5)

]).orderBy('id')

```

We use the @ref:[`rf_tile_mean`](reference.md#rf-tile-mean) function to compute the _tile_ aggregate mean of cells in each row of column `tile`. The mean of each _tile_ is computed separately, so the first mean is about 1.0 and the second mean is about 3.0. Notice that the number of rows in the DataFrame is the same before and after the aggregation.

```python, tile_mean

rf.select(F.col('id'), rf_tile_mean(F.col('tile')))

```

We use the @ref:[`rf_agg_mean`](reference.md#rf-agg-mean) function to compute the DataFrame aggregate, which averages values across the fifty cells in two rows. Note that only a single row is returned since the average is computed over the full DataFrame.

```python, agg_mean

rf.agg(rf_agg_mean(F.col('tile')))

```

We use the @ref:[`rf_agg_local_mean`](reference.md#rf-agg-local-mean) function to compute the element-wise local aggregate mean across the two rows. For this aggregation, we are computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the _tile_.

To compute an element-wise local aggregate, _tiles_ need to have the same dimensions. In this case, both _tiles_ have 5 rows and 5 columns. If we tried to compute an element-wise local aggregate over the DataFrame without equal _tile_ dimensions, we would get a runtime error.

```python, local_mean

rf.agg(rf_agg_local_mean('tile')) \

.first()[0].cells.data # display the contents of the Tile array

```

## Cell Counts Example

We can also count the total number of data and NoData cells over all the _tiles_ in a DataFrame using @ref:[`rf_agg_data_cells`](reference.md#rf-agg-data-cells) and @ref:[`rf_agg_no_data_cells`](reference.md#rf-agg-no-data-cells). There are ~3.8 million data cells and ~1.9 million NoData cells in this DataFrame. See the section on @ref:["NoData" handling](nodata-handling.md) for additional discussion on handling missing data.

```python, cell_counts

rf = spark.read.raster('https://rasterframes.s3.amazonaws.com/samples/MCD43A4.006/11/05/2018233/MCD43A4.A2018233.h11v05.006.2018242035530_B02.TIF')

stats = rf.agg(rf_agg_data_cells('proj_raster'), rf_agg_no_data_cells('proj_raster'))

stats

```

## Statistical Summaries

The statistical summary functions return a summary of cell values: number of data cells, number of NoData cells, minimum, maximum, mean, and variance, which can be computed as a _tile_ aggregate, a DataFrame aggregate, or an element-wise local aggregate.

The @ref:[`rf_tile_stats`](reference.md#rf-tile-stats) function computes summary statistics separately for each row in a _tile_ column as shown below.

```python, tile_stats

rf = spark.read.raster('https://rasterframes.s3.amazonaws.com/samples/luray_snp/B02.tif')

stats = rf.select(rf_tile_stats('proj_raster').alias('stats'))

stats.printSchema()

```

```python, show_stats

stats.select('stats.min', 'stats.max', 'stats.mean', 'stats.variance')

```

The @ref:[`rf_agg_stats`](reference.md#rf-agg-stats) function aggregates over all of the _tiles_ in a DataFrame and returns a statistical summary of all cell values as shown below.

```python, agg_stats

stats = rf.agg(rf_agg_stats('proj_raster').alias('stats')) \

.select('stats.min', 'stats.max', 'stats.mean', 'stats.variance')

stats

```

The @ref:[`rf_agg_local_stats`](reference.md#rf-agg-local-stats) function computes the element-wise local aggregate statistical summary as shown below. The DataFrame used in the previous two code blocks has unequal _tile_ dimensions, so a different DataFrame is used in this code block to avoid a runtime error.

```python, agg_local_stats

rf = spark.createDataFrame([

Row(id=1, tile=t1),

Row(id=3, tile=t1 * 3),

Row(id=5, tile=t1 * 5)

]).agg(rf_agg_local_stats('tile').alias('stats'))

agg_local_stats = rf.select('stats.min', 'stats.max', 'stats.mean', 'stats.variance').collect()

for r in agg_local_stats:

for stat in r.asDict():

print(stat, ':\n', r[stat], '\n')

```

## Histogram

The @ref:[`rf_tile_histogram`](reference.md#rf-tile-histogram) function computes a count of cell values within each row of _tile_ and outputs a `bins` array with the schema below. In the graph below, we have plotted each bin's `value` on the x-axis and `count` on the y-axis for the _tile_ in the first row of the DataFrame.

```python, tile_histogram

import matplotlib.pyplot as plt

rf = spark.read.raster('https://rasterframes.s3.amazonaws.com/samples/MCD43A4.006/11/05/2018233/MCD43A4.A2018233.h11v05.006.2018242035530_B02.TIF')

hist_df = rf.select(rf_tile_histogram('proj_raster')['bins'].alias('bins'))

hist_df.printSchema()

bins_row = hist_df.first()

values = [int(bin['value']) for bin in bins_row.bins]

counts = [int(bin['count']) for bin in bins_row.bins]

plt.hist(values, weights=counts, bins=100)

plt.show()

```

The @ref:[`rf_agg_approx_histogram`](reference.md#rf-agg-approx-histogram) function computes a count of cell values across all of the rows of _tile_ in a DataFrame or group. In the example below, the range of the y-axis is significantly wider than the range of the y-axis on the previous histogram since this histogram was computed for all cell values in the DataFrame.

```python, agg_histogram

bins_list = rf.agg(

rf_agg_approx_histogram('proj_raster')['bins'].alias('bins')

).collect()

values = [int(row['value']) for row in bins_list[0].bins]

counts = [int(row['count']) for row in bins_list[0].bins]

plt.hist(values, weights=counts, bins=100)

plt.show()

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

aggregation.pymd

Latest commit

History

aggregation.pymd

File metadata and controls