-
Notifications
You must be signed in to change notification settings - Fork 48
Expand file tree
/
Copy pathaggregation.pymd
More file actions
152 lines (105 loc) · 7.42 KB
/
aggregation.pymd
File metadata and controls
152 lines (105 loc) · 7.42 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
# Aggregation
```python, setup, echo=False
from pyrasterframes import rf_ipython
from pyrasterframes.utils import create_rf_spark_session
from pyrasterframes.rasterfunctions import *
from pyspark.sql import *
import os
import numpy as np
np.set_printoptions(precision=3, floatmode='maxprec')
spark = create_rf_spark_session()
```
There are three types of aggregate functions available in RasterFrames: _tile_ aggregate, DataFrame aggregate, and element-wise local aggregate. In the @ref:[tile aggregate functions](reference.md#tile-statistics), we are computing a statistical summary per row of a _tile_ column in a DataFrame. In the @ref:[DataFrame aggregate functions](reference.md#aggregate-tile-statistics), we are computing statistical summaries over all of the cell values *and* across all of the rows in the DataFrame or group. In the @ref:[element-wise local aggregate functions](reference.md#tile-local-aggregate-statistics), we are computing the element-wise statistical summary across a DataFrame or group of _tiles_. In the latter two cases, when @ref:[vector data](vector-data.md) is the grouping column, the results are @ref:[zonal statistics](zonal-algebra.md).
## Tile Mean Example
We can illustrate aggregate differences by computing an aggregate mean. First, we create a sample DataFrame of 2 _tiles_. The _tiles_ will contain normally distributed cell values with the first row's mean at 1.0 and the second row's mean at 3.0. For details on use of the `Tile` class see @ref:[the page on numpy interoperability](numpy-pandas.md).
```python, create_tile1
from pyrasterframes.rf_types import Tile, CellType
t1 = Tile(1 + 0.1 * np.random.randn(5,5), CellType('float64raw'))
t1.cells # display the array in the Tile
```
```python, showt5
t5 = Tile(5 + 0.1 * np.random.randn(5,5), CellType('float64raw'))
t5.cells
```
Create a Spark DataFrame from the Tile objects.
```python, create_dataframe
import pyspark.sql.functions as F
from pyspark.sql import Row
rf = spark.createDataFrame([
Row(id=1, tile=t1),
Row(id=2, tile=t5)
]).orderBy('id')
```
We use the @ref:[`rf_tile_mean`](reference.md#rf-tile-mean) function to compute the _tile_ aggregate mean of cells in each row of column `tile`. The mean of each _tile_ is computed separately, so the first mean is about 1.0 and the second mean is about 3.0. Notice that the number of rows in the DataFrame is the same before and after the aggregation.
```python, tile_mean
rf.select(F.col('id'), rf_tile_mean(F.col('tile')))
```
We use the @ref:[`rf_agg_mean`](reference.md#rf-agg-mean) function to compute the DataFrame aggregate, which averages values across the fifty cells in two rows. Note that only a single row is returned since the average is computed over the full DataFrame.
```python, agg_mean
rf.agg(rf_agg_mean(F.col('tile')))
```
We use the @ref:[`rf_agg_local_mean`](reference.md#rf-agg-local-mean) function to compute the element-wise local aggregate mean across the two rows. For this aggregation, we are computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the _tile_.
To compute an element-wise local aggregate, _tiles_ need to have the same dimensions. In this case, both _tiles_ have 5 rows and 5 columns. If we tried to compute an element-wise local aggregate over the DataFrame without equal _tile_ dimensions, we would get a runtime error.
```python, local_mean
rf.agg(rf_agg_local_mean('tile')) \
.first()[0].cells.data # display the contents of the Tile array
```
## Cell Counts Example
We can also count the total number of data and NoData cells over all the _tiles_ in a DataFrame using @ref:[`rf_agg_data_cells`](reference.md#rf-agg-data-cells) and @ref:[`rf_agg_no_data_cells`](reference.md#rf-agg-no-data-cells). There are ~3.8 million data cells and ~1.9 million NoData cells in this DataFrame. See the section on @ref:["NoData" handling](nodata-handling.md) for additional discussion on handling missing data.
```python, cell_counts
rf = spark.read.raster('https://rasterframes.s3.amazonaws.com/samples/MCD43A4.006/11/05/2018233/MCD43A4.A2018233.h11v05.006.2018242035530_B02.TIF')
stats = rf.agg(rf_agg_data_cells('proj_raster'), rf_agg_no_data_cells('proj_raster'))
stats
```
## Statistical Summaries
The statistical summary functions return a summary of cell values: number of data cells, number of NoData cells, minimum, maximum, mean, and variance, which can be computed as a _tile_ aggregate, a DataFrame aggregate, or an element-wise local aggregate.
The @ref:[`rf_tile_stats`](reference.md#rf-tile-stats) function computes summary statistics separately for each row in a _tile_ column as shown below.
```python, tile_stats
rf = spark.read.raster('https://rasterframes.s3.amazonaws.com/samples/luray_snp/B02.tif')
stats = rf.select(rf_tile_stats('proj_raster').alias('stats'))
stats.printSchema()
```
```python, show_stats
stats.select('stats.min', 'stats.max', 'stats.mean', 'stats.variance')
```
The @ref:[`rf_agg_stats`](reference.md#rf-agg-stats) function aggregates over all of the _tiles_ in a DataFrame and returns a statistical summary of all cell values as shown below.
```python, agg_stats
stats = rf.agg(rf_agg_stats('proj_raster').alias('stats')) \
.select('stats.min', 'stats.max', 'stats.mean', 'stats.variance')
stats
```
The @ref:[`rf_agg_local_stats`](reference.md#rf-agg-local-stats) function computes the element-wise local aggregate statistical summary as shown below. The DataFrame used in the previous two code blocks has unequal _tile_ dimensions, so a different DataFrame is used in this code block to avoid a runtime error.
```python, agg_local_stats
rf = spark.createDataFrame([
Row(id=1, tile=t1),
Row(id=3, tile=t1 * 3),
Row(id=5, tile=t1 * 5)
]).agg(rf_agg_local_stats('tile').alias('stats'))
agg_local_stats = rf.select('stats.min', 'stats.max', 'stats.mean', 'stats.variance').collect()
for r in agg_local_stats:
for stat in r.asDict():
print(stat, ':\n', r[stat], '\n')
```
## Histogram
The @ref:[`rf_tile_histogram`](reference.md#rf-tile-histogram) function computes a count of cell values within each row of _tile_ and outputs a `bins` array with the schema below. In the graph below, we have plotted each bin's `value` on the x-axis and `count` on the y-axis for the _tile_ in the first row of the DataFrame.
```python, tile_histogram
import matplotlib.pyplot as plt
rf = spark.read.raster('https://rasterframes.s3.amazonaws.com/samples/MCD43A4.006/11/05/2018233/MCD43A4.A2018233.h11v05.006.2018242035530_B02.TIF')
hist_df = rf.select(rf_tile_histogram('proj_raster')['bins'].alias('bins'))
hist_df.printSchema()
bins_row = hist_df.first()
values = [int(bin['value']) for bin in bins_row.bins]
counts = [int(bin['count']) for bin in bins_row.bins]
plt.hist(values, weights=counts, bins=100)
plt.show()
```
The @ref:[`rf_agg_approx_histogram`](reference.md#rf-agg-approx-histogram) function computes a count of cell values across all of the rows of _tile_ in a DataFrame or group. In the example below, the range of the y-axis is significantly wider than the range of the y-axis on the previous histogram since this histogram was computed for all cell values in the DataFrame.
```python, agg_histogram
bins_list = rf.agg(
rf_agg_approx_histogram('proj_raster')['bins'].alias('bins')
).collect()
values = [int(row['value']) for row in bins_list[0].bins]
counts = [int(row['count']) for row in bins_list[0].bins]
plt.hist(values, weights=counts, bins=100)
plt.show()
```