forked from PolMine/RcppCWB
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathvignette.html
More file actions
323 lines (295 loc) · 28.2 KB
/
vignette.html
File metadata and controls
323 lines (295 loc) · 28.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
<!DOCTYPE html>
<!-- Generated by pkgdown: do not edit by hand --><html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Writing performance code with RcppCWB • RcppCWB</title>
<!-- jquery --><script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.4.1/jquery.min.js" integrity="sha256-CSXorXvZcTkaix6Yvo6HppcZGetbYMGWSFlBw8HfCJo=" crossorigin="anonymous"></script><!-- Bootstrap --><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/3.4.1/css/bootstrap.min.css" integrity="sha256-bZLfwXAP04zRMK2BjiO8iu9pf4FbLqX6zitd+tIvLhE=" crossorigin="anonymous">
<script src="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/3.4.1/js/bootstrap.min.js" integrity="sha256-nuL8/2cJ5NDSSwnKD8VqreErSWHtnEP9E7AySL+1ev4=" crossorigin="anonymous"></script><!-- bootstrap-toc --><link rel="stylesheet" href="../bootstrap-toc.css">
<script src="../bootstrap-toc.js"></script><!-- Font Awesome icons --><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/all.min.css" integrity="sha256-mmgLkCYLUQbXn0B1SRqzHar6dCnv9oZFPEC1g1cwlkk=" crossorigin="anonymous">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.12.1/css/v4-shims.min.css" integrity="sha256-wZjR52fzng1pJHwx4aV2AO3yyTOXrcDW7jBpJtTwVxw=" crossorigin="anonymous">
<!-- clipboard.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/2.0.6/clipboard.min.js" integrity="sha256-inc5kl9MA1hkeYUt+EC3BhlIgyp/2jDIyBLS6k3UxPI=" crossorigin="anonymous"></script><!-- headroom.js --><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/headroom.min.js" integrity="sha256-AsUX4SJE1+yuDu5+mAVzJbuYNPHj/WroHuZ8Ir/CkE0=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/headroom/0.11.0/jQuery.headroom.min.js" integrity="sha256-ZX/yNShbjqsohH1k95liqY9Gd8uOiE1S4vZc+9KQ1K4=" crossorigin="anonymous"></script><!-- pkgdown --><link href="../pkgdown.css" rel="stylesheet">
<script src="../pkgdown.js"></script><meta property="og:title" content="Writing performance code with RcppCWB">
<meta property="og:description" content="RcppCWB">
<!-- mathjax --><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js" integrity="sha256-nvJJv9wWKEm88qvoQl9ekL2J+k/RWIsaSScxxlsrv8k=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/config/TeX-AMS-MML_HTMLorMML.js" integrity="sha256-84DKXVJXs0/F8OTMzX4UR909+jtl4G7SPypPavF+GfA=" crossorigin="anonymous"></script><!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<body data-spy="scroll" data-target="#toc">
<div class="container template-article">
<header><div class="navbar navbar-default navbar-fixed-top" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<span class="navbar-brand">
<a class="navbar-link" href="../index.html">RcppCWB</a>
<span class="version label label-default" data-toggle="tooltip" data-placement="bottom" title="">0.6.0</span>
</span>
</div>
<div id="navbar" class="navbar-collapse collapse">
<ul class="nav navbar-nav">
<li>
<a href="../reference/index.html">Reference</a>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" data-bs-toggle="dropdown" aria-expanded="false">
Articles
<span class="caret"></span>
</a>
<ul class="dropdown-menu" role="menu">
<li>
<a href="../articles/vignette.html">Writing performance code with RcppCWB</a>
</li>
</ul>
</li>
<li>
<a href="../news/index.html">Changelog</a>
</li>
</ul>
<ul class="nav navbar-nav navbar-right">
<li>
<a href="https://github.com/PolMine/RcppCWB/" class="external-link">
<span class="fab fa-github fa-lg"></span>
</a>
</li>
</ul>
</div>
<!--/.nav-collapse -->
</div>
<!--/.container -->
</div>
<!--/.navbar -->
</header><div class="row">
<div class="col-md-9 contents">
<div class="page-header toc-ignore">
<h1 data-toc-skip>Writing performance code with RcppCWB</h1>
<h4 data-toc-skip class="author">Andreas Blaette
<h4 data-toc-skip class="date">2023-03-23</h4>
<small class="dont-index">Source: <a href="https://github.com/PolMine/RcppCWB/blob/HEAD/vignettes/vignette.Rmd" class="external-link"><code>vignettes/vignette.Rmd</code></a></small>
<div class="hidden name"><code>vignette.Rmd</code></div>
</div>
<div class="section level2">
<h2 id="rationale">Rationale<a class="anchor" aria-label="anchor" href="#rationale"></a>
</h2>
<p>The RcppCWB package exposes the functionality of the Corpus Workbench
(CWB) to R, so that R users can benefit from the performance of the C
code of the CWB. Ease of use and performance should be great most of the
time. But there are scenarios when the interface between R and C/C++ is
a bottleneck for achieving sufficient performance. In this case, using
CWB functionality in C++ functions exposed to R using
<code><a href="https://rdrr.io/pkg/Rcpp/man/cppFunction.html" class="external-link">Rcpp::cppFunction()</a></code> or <code><a href="https://rdrr.io/pkg/Rcpp/man/sourceCpp.html" class="external-link">Rcpp::sourceCpp()</a></code> may
solve issues with performance and memory limitations.</p>
</div>
<div class="section level2">
<h2 id="basics">Basics<a class="anchor" aria-label="anchor" href="#basics"></a>
</h2>
<p>Writing C++ functions that use CWB functionality requires loading the
Rcpp and RcppCWB package.</p>
<div class="sourceCode" id="cb1"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="kw"><a href="https://rdrr.io/r/base/library.html" class="external-link">library</a></span><span class="op">(</span><span class="va"><a href="https://www.rcpp.org" class="external-link">Rcpp</a></span><span class="op">)</span></span>
<span><span class="kw"><a href="https://rdrr.io/r/base/library.html" class="external-link">library</a></span><span class="op">(</span><span class="va"><a href="https://github.com/PolMine/RcppCWB" class="external-link">RcppCWB</a></span><span class="op">)</span></span></code></pre></div>
<p>We need to be aware that the default functions for accessing the CWB
functionality involve passing length-one character vectors used for
looking up the C representation of structural or positional attributes
for corpora that have been loaded. It is more efficient to perform this
lookup only once. Following this rationale, a set of functions exposes
CWB functionality closer to the C logic, passing pointers to attributes
that have been looked up.</p>
<p>This functionality can also be used from R. For instance, we look up
the p-attribute “word” of the “REUTERS” corpus as follows.</p>
<div class="sourceCode" id="cb2"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">p_attr_word</span> <span class="op"><-</span> <span class="fu"><a href="../reference/cl_rework.html">p_attr</a></span><span class="op">(</span></span>
<span> corpus <span class="op">=</span> <span class="st">"REUTERS"</span>,</span>
<span> p_attribute <span class="op">=</span> <span class="st">"word"</span>,</span>
<span> registry <span class="op">=</span> <span class="fu"><a href="../reference/tmp_registry.html">get_tmp_registry</a></span><span class="op">(</span><span class="op">)</span></span>
<span><span class="op">)</span></span></code></pre></div>
<p>And we use <code><a href="../reference/cl_rework.html">cpos_to_str()</a></code> to decode the first words of
the corpus.</p>
<div class="sourceCode" id="cb3"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="fu"><a href="../reference/cl_rework.html">cpos_to_str</a></span><span class="op">(</span><span class="va">p_attr_word</span>, <span class="fl">0</span><span class="op">:</span><span class="fl">10</span><span class="op">)</span></span></code></pre></div>
<pre><code><span><span class="co">## [1] "Diamond" "Shamrock" "Corp" "said" "that" "effective"</span></span>
<span><span class="co">## [7] "today" "it" "had" "cut" "its"</span></span></code></pre>
<p>While this may also be useful when writing R code, this lower-level
functionality is particularly well-suited for writing high-performance
C++ code exposed to R.</p>
</div>
<div class="section level2">
<h2 id="inline-c-functions">Inline C++ functions<a class="anchor" aria-label="anchor" href="#inline-c-functions"></a>
</h2>
<p>We start with a first simple scenario, which uses
<code><a href="https://rdrr.io/pkg/Rcpp/man/cppFunction.html" class="external-link">cppFunction()</a></code> to source an inline C++ function in an R
session.</p>
<div class="sourceCode" id="cb5"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/pkg/Rcpp/man/cppFunction.html" class="external-link">cppFunction</a></span><span class="op">(</span></span>
<span> <span class="st">'Rcpp::StringVector get_str(SEXP corpus, SEXP p_attribute, SEXP registry, Rcpp::IntegerVector cpos){</span></span>
<span><span class="st"> SEXP attr;</span></span>
<span><span class="st"> Rcpp::StringVector result;</span></span>
<span><span class="st"> attr = RcppCWB::p_attr(corpus, p_attribute, registry);</span></span>
<span><span class="st"> result = RcppCWB::cpos_to_str(attr, cpos);</span></span>
<span><span class="st"> return(result);</span></span>
<span><span class="st"> }'</span>,</span>
<span> depends <span class="op">=</span> <span class="st">"RcppCWB"</span></span>
<span><span class="op">)</span></span></code></pre></div>
<p>This is not a very interesting example, but using the function
works:</p>
<div class="sourceCode" id="cb6"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="fu">get_str</span><span class="op">(</span><span class="st">"REUTERS"</span>, <span class="st">"word"</span>, <span class="fu">RcppCWB</span><span class="fu">::</span><span class="fu"><a href="../reference/tmp_registry.html">get_tmp_registry</a></span><span class="op">(</span><span class="op">)</span>, <span class="fl">0</span><span class="op">:</span><span class="fl">50</span><span class="op">)</span></span></code></pre></div>
<pre><code><span><span class="co">## [1] "Diamond" "Shamrock" "Corp" "said" "that" </span></span>
<span><span class="co">## [6] "effective" "today" "it" "had" "cut" </span></span>
<span><span class="co">## [11] "its" "contract" "prices" "for" "crude" </span></span>
<span><span class="co">## [16] "oil" "by" "1.50" "dlrs" "a" </span></span>
<span><span class="co">## [21] "barrel" "The" "reduction" "brings" "its" </span></span>
<span><span class="co">## [26] "posted" "price" "for" "West" "Texas" </span></span>
<span><span class="co">## [31] "Intermediate" "to" "16.00" "dlrs" "a" </span></span>
<span><span class="co">## [36] "barrel" "the" "copany" "said" "The" </span></span>
<span><span class="co">## [41] "price" "reduction" "today" "was" "made" </span></span>
<span><span class="co">## [46] "in" "the" "light" "of" "falling" </span></span>
<span><span class="co">## [51] "oil"</span></span></code></pre>
</div>
<div class="section level2">
<h2 id="source-c-function">Source C++ function<a class="anchor" aria-label="anchor" href="#source-c-function"></a>
</h2>
<p>To provide a more interesting real-life example, we demonstrate a
solution to the following scenario: It may be necessary to decode an
entire corpus, and to write the tokens of corpus regions to a file in a
line-by-line manner. Computing word embeddings may require this input
format, for instance.</p>
<p>But if the corpus is really large, decoding the corpus entirely and
then writing everything to disk may hit memory limitations. Decoding the
tokens of the corpus successively and writing content to the output file
on the spot is an obvious solution, but moving data between the R/C++/C
interface for every single token is excessively slow. A pure C++
implementation will be much more effective.</p>
<p>The following C++ file that relies on CWB functions as exposed by
RcppCWB addresses the scenario.</p>
<div class="sourceCode" id="cb8"><pre class="sourceCode cpp"><code class="sourceCode cpp"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co">// [[Rcpp::depends(RcppCWB)]]</span></span>
<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im"><Rcpp.h></span></span>
<span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im"><RcppCWB.h></span></span>
<span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-5"><a href="#cb8-5" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im"><stdio.h></span></span>
<span id="cb8-6"><a href="#cb8-6" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im"><iostream></span></span>
<span id="cb8-7"><a href="#cb8-7" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im"><fstream></span></span>
<span id="cb8-8"><a href="#cb8-8" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im"><cstdlib></span></span>
<span id="cb8-9"><a href="#cb8-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-10"><a href="#cb8-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-11"><a href="#cb8-11" aria-hidden="true" tabindex="-1"></a><span class="co">// [[Rcpp::export]]</span></span>
<span id="cb8-12"><a href="#cb8-12" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> write_token_stream<span class="op">(</span>SEXP corpus<span class="op">,</span> SEXP p_attribute<span class="op">,</span> SEXP <span class="va">s_attribute</span><span class="op">,</span> SEXP registry<span class="op">,</span> SEXP <span class="dt">attribute_type</span><span class="op">,</span> Rcpp<span class="op">::</span>StringVector filename<span class="op">)</span> <span class="op">{</span></span>
<span id="cb8-13"><a href="#cb8-13" aria-hidden="true" tabindex="-1"></a> </span>
<span id="cb8-14"><a href="#cb8-14" aria-hidden="true" tabindex="-1"></a> <span class="dt">int</span> i<span class="op">,</span> n<span class="op">,</span> region_size<span class="op">;</span></span>
<span id="cb8-15"><a href="#cb8-15" aria-hidden="true" tabindex="-1"></a> Rcpp<span class="op">::</span>IntegerVector region<span class="op">(</span><span class="dv">2</span><span class="op">);</span></span>
<span id="cb8-16"><a href="#cb8-16" aria-hidden="true" tabindex="-1"></a> <span class="bu">std::</span>ofstream<span class="op"> </span>outdata<span class="op">;</span></span>
<span id="cb8-17"><a href="#cb8-17" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-18"><a href="#cb8-18" aria-hidden="true" tabindex="-1"></a> n <span class="op">=</span> RcppCWB<span class="op">::</span>attribute_size<span class="op">(</span>corpus<span class="op">,</span> <span class="va">s_attribute</span><span class="op">,</span> <span class="dt">attribute_type</span><span class="op">,</span> registry<span class="op">);</span></span>
<span id="cb8-19"><a href="#cb8-19" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-20"><a href="#cb8-20" aria-hidden="true" tabindex="-1"></a> outdata<span class="op">.</span>open<span class="op">(</span>filename<span class="op">[</span><span class="dv">0</span><span class="op">]);</span></span>
<span id="cb8-21"><a href="#cb8-21" aria-hidden="true" tabindex="-1"></a> <span class="cf">if</span><span class="op">(</span> <span class="op">!</span>outdata <span class="op">)</span> <span class="op">{</span></span>
<span id="cb8-22"><a href="#cb8-22" aria-hidden="true" tabindex="-1"></a> <span class="bu">std::</span>cerr<span class="op"> <<</span> <span class="st">"Error: file could not be opened"</span> <span class="op"><<</span> <span class="bu">std::</span>endl<span class="op">;</span></span>
<span id="cb8-23"><a href="#cb8-23" aria-hidden="true" tabindex="-1"></a> exit<span class="op">(</span><span class="dv">1</span><span class="op">);</span></span>
<span id="cb8-24"><a href="#cb8-24" aria-hidden="true" tabindex="-1"></a> <span class="op">}</span></span>
<span id="cb8-25"><a href="#cb8-25" aria-hidden="true" tabindex="-1"></a> </span>
<span id="cb8-26"><a href="#cb8-26" aria-hidden="true" tabindex="-1"></a> <span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op"><</span> n<span class="op">;</span> i<span class="op">++){</span></span>
<span id="cb8-27"><a href="#cb8-27" aria-hidden="true" tabindex="-1"></a> region <span class="op">=</span> RcppCWB<span class="op">::</span>struc2cpos<span class="op">(</span>corpus<span class="op">,</span> <span class="va">s_attribute</span><span class="op">,</span> registry<span class="op">,</span> i<span class="op">);</span></span>
<span id="cb8-28"><a href="#cb8-28" aria-hidden="true" tabindex="-1"></a> region_size <span class="op">=</span> region<span class="op">[</span><span class="dv">1</span><span class="op">]</span> <span class="op">-</span> region<span class="op">[</span><span class="dv">0</span><span class="op">]</span> <span class="op">+</span> <span class="dv">1</span><span class="op">;</span></span>
<span id="cb8-29"><a href="#cb8-29" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-30"><a href="#cb8-30" aria-hidden="true" tabindex="-1"></a> Rcpp<span class="op">::</span>IntegerVector cpos<span class="op">(</span>region_size<span class="op">);</span></span>
<span id="cb8-31"><a href="#cb8-31" aria-hidden="true" tabindex="-1"></a> cpos <span class="op">=</span> Rcpp<span class="op">::</span>seq<span class="op">(</span>region<span class="op">[</span><span class="dv">0</span><span class="op">],</span> region<span class="op">[</span><span class="dv">1</span><span class="op">]);</span></span>
<span id="cb8-32"><a href="#cb8-32" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-33"><a href="#cb8-33" aria-hidden="true" tabindex="-1"></a> Rcpp<span class="op">::</span>StringVector values<span class="op">(</span>region_size<span class="op">);</span></span>
<span id="cb8-34"><a href="#cb8-34" aria-hidden="true" tabindex="-1"></a> values <span class="op">=</span> RcppCWB<span class="op">::</span>cpos2str<span class="op">(</span>corpus<span class="op">,</span> p_attribute<span class="op">,</span> registry<span class="op">,</span> cpos<span class="op">);</span></span>
<span id="cb8-35"><a href="#cb8-35" aria-hidden="true" tabindex="-1"></a> </span>
<span id="cb8-36"><a href="#cb8-36" aria-hidden="true" tabindex="-1"></a> <span class="dt">int</span> j<span class="op">;</span></span>
<span id="cb8-37"><a href="#cb8-37" aria-hidden="true" tabindex="-1"></a> <span class="cf">for</span> <span class="op">(</span>j <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> j <span class="op"><</span> values<span class="op">.</span>length<span class="op">();</span> j<span class="op">++){</span> </span>
<span id="cb8-38"><a href="#cb8-38" aria-hidden="true" tabindex="-1"></a> outdata <span class="op"><<</span> values<span class="op">(</span>j<span class="op">);</span></span>
<span id="cb8-39"><a href="#cb8-39" aria-hidden="true" tabindex="-1"></a> <span class="cf">if</span> <span class="op">(</span>j <span class="op"><</span> values<span class="op">.</span>length<span class="op">()</span> <span class="op">-</span> <span class="dv">1</span><span class="op">){</span></span>
<span id="cb8-40"><a href="#cb8-40" aria-hidden="true" tabindex="-1"></a> outdata <span class="op"><<</span> <span class="st">" "</span><span class="op">;</span></span>
<span id="cb8-41"><a href="#cb8-41" aria-hidden="true" tabindex="-1"></a> <span class="op">}</span></span>
<span id="cb8-42"><a href="#cb8-42" aria-hidden="true" tabindex="-1"></a> <span class="op">}</span></span>
<span id="cb8-43"><a href="#cb8-43" aria-hidden="true" tabindex="-1"></a> outdata <span class="op"><<</span> <span class="bu">std::</span>endl<span class="op">;</span></span>
<span id="cb8-44"><a href="#cb8-44" aria-hidden="true" tabindex="-1"></a> <span class="op">}</span></span>
<span id="cb8-45"><a href="#cb8-45" aria-hidden="true" tabindex="-1"></a> outdata<span class="op">.</span>close<span class="op">();</span></span>
<span id="cb8-46"><a href="#cb8-46" aria-hidden="true" tabindex="-1"></a> </span>
<span id="cb8-47"><a href="#cb8-47" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb8-48"><a href="#cb8-48" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>This code can be sourced, compiled and exposed to R using
<code><a href="https://rdrr.io/pkg/Rcpp/man/sourceCpp.html" class="external-link">sourceCpp()</a></code>.</p>
<div class="sourceCode" id="cb9"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/pkg/Rcpp/man/sourceCpp.html" class="external-link">sourceCpp</a></span><span class="op">(</span>file <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/system.file.html" class="external-link">system.file</a></span><span class="op">(</span>package <span class="op">=</span> <span class="st">"RcppCWB"</span>, <span class="st">"cpp"</span>, <span class="st">"fastdecode.cpp"</span><span class="op">)</span><span class="op">)</span></span></code></pre></div>
<p>We exemplify that everything works as intended using the (smallish)
REUTERS corpus. So we create the output …</p>
<div class="sourceCode" id="cb10"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="va">outfile</span> <span class="op"><-</span> <span class="fu"><a href="https://rdrr.io/r/base/tempfile.html" class="external-link">tempfile</a></span><span class="op">(</span>fileext <span class="op">=</span> <span class="st">".txt"</span><span class="op">)</span></span>
<span></span>
<span><span class="fu">write_token_stream</span><span class="op">(</span></span>
<span> corpus <span class="op">=</span> <span class="st">"REUTERS"</span>,</span>
<span> p_attribute <span class="op">=</span> <span class="st">"word"</span>, </span>
<span> s_attribute <span class="op">=</span> <span class="st">"id"</span>,</span>
<span> attribute_type <span class="op">=</span> <span class="st">"s"</span>,</span>
<span> registry <span class="op">=</span> <span class="fu">RcppCWB</span><span class="fu">::</span><span class="fu"><a href="../reference/tmp_registry.html">get_tmp_registry</a></span><span class="op">(</span><span class="op">)</span>,</span>
<span> filename <span class="op">=</span> <span class="va">outfile</span></span>
<span><span class="op">)</span></span></code></pre></div>
<pre><code><span><span class="co">## [1] 0</span></span></code></pre>
<p>… and read it (showing the content selectively) to convey that the
corpus data has been exported as intended.</p>
<div class="sourceCode" id="cb12"><pre class="downlit sourceCode r">
<code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/readLines.html" class="external-link">readLines</a></span><span class="op">(</span><span class="va">outfile</span><span class="op">)</span> <span class="op">|></span></span>
<span> <span class="fu"><a href="https://rdrr.io/r/base/lapply.html" class="external-link">lapply</a></span><span class="op">(</span><span class="va">substr</span>, <span class="fl">1</span>, <span class="fl">75</span><span class="op">)</span> <span class="op">|></span></span>
<span> <span class="fu"><a href="https://rdrr.io/r/base/unlist.html" class="external-link">unlist</a></span><span class="op">(</span><span class="op">)</span></span></code></pre></div>
<pre><code><span><span class="co">## [1] "Diamond Shamrock Corp said that effective today it had cut its contract pri"</span></span>
<span><span class="co">## [2] "OPEC may be forced to meet before a scheduled June session to readdress its"</span></span>
<span><span class="co">## [3] "Texaco Canada said it lowered the contract price it will pay for crude oil "</span></span>
<span><span class="co">## [4] "Marathon Petroleum Co said it reduced the contract price it will pay for al"</span></span>
<span><span class="co">## [5] "Houston Oil Trust said that independent petroleum engineers completed an an"</span></span>
<span><span class="co">## [6] "Kuwait s Oil Minister in remarks published today said there were no plans f"</span></span>
<span><span class="co">## [7] "Indonesia appears to be nearing a political crossroads over measures to der"</span></span>
<span><span class="co">## [8] "Saudi riyal interbank deposits were steady at yesterday's higher levels in "</span></span>
<span><span class="co">## [9] "The Gulf oil state of Qatar recovering slightly from last year's decline in"</span></span>
<span><span class="co">## [10] "Saudi Arabian Oil Minister Hisham Nazer reiterated the kingdom's commitment"</span></span>
<span><span class="co">## [11] "Saudi crude oil output last month fell to an average of 3.5 mln barrels per"</span></span>
<span><span class="co">## [12] "Deputy oil ministers from six Gulf Arab states will meet in Bahrain today t"</span></span>
<span><span class="co">## [13] "Saudi Arabian Oil Minister Hisham Nazer reiterated the kingdom's commitment"</span></span>
<span><span class="co">## [14] "Kuwait's oil minister said in a newspaper interview that there were no plan"</span></span>
<span><span class="co">## [15] "The port of Philadelphia was closed when a Cypriot oil tanker Seapride II r"</span></span>
<span><span class="co">## [16] "A study group said the United States should increase its strategic petroleu"</span></span>
<span><span class="co">## [17] "A study group said the United States should increase its strategic petroleu"</span></span>
<span><span class="co">## [18] "Unocal Corp's Union Oil Co said it lowered its posted prices for crude oil "</span></span>
<span><span class="co">## [19] "The New York Mercantile Exchange set April one for the debut of a new proce"</span></span>
<span><span class="co">## [20] "Argentine crude oil production was down 10.8 pct in January 1987 to 12.32 m"</span></span></code></pre>
</div>
<div class="section level2">
<h2 id="moving-ahead">Moving ahead<a class="anchor" aria-label="anchor" href="#moving-ahead"></a>
</h2>
<p>Writing C++ functions is obviously more demanding than writing R
code. But using CWB functionality as exposed by RcppCWB in C++ functions
that can be used from R may be a great solution to performance and
memory issues. Rcpp brings writing C++ code much closer to what R users
are acquainted with, making writing high-performance C++ close much
easier. So we encourage considering this option when pure R solutions
are not fast enough.</p>
</div>
</div>
<div class="col-md-3 hidden-xs hidden-sm" id="pkgdown-sidebar">
<nav id="toc" data-toggle="toc"><h2 data-toc-skip>Contents</h2>
</nav>
</div>
</div>
<footer><div class="copyright">
<p></p>
<p>Developed by Andreas Blaette, Bernard Desgraupes, Sylvain Loiseau.</p>
</div>
<div class="pkgdown">
<p></p>
<p>Site built with <a href="https://pkgdown.r-lib.org/" class="external-link">pkgdown</a> 2.0.6.</p>
</div>
</footer>
</div>
</body>
</html>