Package Arrow': November 20, 2021
Package Arrow': November 20, 2021
Package Arrow': November 20, 2021
BugReports https://issues.apache.org/jira/projects/ARROW/issues
Encoding UTF-8
Language en-US
SystemRequirements C++11; for AWS S3 support on Linux, libcurl and
openssl (optional)
Biarch true
Imports assertthat, bit64 (>= 0.9-7), methods, purrr, R6, rlang,
stats, tidyselect, utils, vctrs
RoxygenNote 7.1.2
Config/testthat/edition 3
VignetteBuilder knitr
Suggests DBI, dbplyr, decor, distro, dplyr, duckdb (>= 0.2.8), hms,
knitr, lubridate, pkgload, reticulate, rmarkdown, stringi,
stringr, testthat, tibble, withr
Collate 'arrowExports.R' 'enums.R' 'arrow-package.R' 'type.R'
'array-data.R' 'arrow-datum.R' 'array.R' 'arrow-tabular.R'
'buffer.R' 'chunked-array.R' 'io.R' 'compression.R' 'scalar.R'
'compute.R' 'config.R' 'csv.R' 'dataset.R' 'dataset-factory.R'
'dataset-format.R' 'dataset-partition.R' 'dataset-scan.R'
'dataset-write.R' 'deprecated.R' 'dictionary.R'
1
2 R topics documented:
R topics documented:
array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
ArrayData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
arrow_available . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
arrow_info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
call_function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
ChunkedArray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Codec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
codec_is_available . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
copy_files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
cpu_count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
create_package_with_all_dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . 14
CsvReadOptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
CsvTableReader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
data-type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
dataset_factory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
R topics documented: 3
DataType . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
DictionaryType . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
FeatherReader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
FileFormat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
FileInfo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
FileSelector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
FileSystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
FileWriteOptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
FixedWidthType . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
flight_connect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
flight_get . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
flight_put . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
FragmentScanOptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
hive_partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
InputStream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
install_arrow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
install_pyarrow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
io_thread_count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
list_compute_functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
list_flights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
load_flight_server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
map_batches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
match_arrow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Message . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
MessageReader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
mmap_create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
mmap_open . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
open_dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
OutputStream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
ParquetArrowReaderProperties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
ParquetFileReader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
ParquetFileWriter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
ParquetWriterProperties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
read_arrow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
read_delim_arrow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
read_feather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
read_json_arrow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
read_message . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
read_parquet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
read_schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
RecordBatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
RecordBatchReader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
RecordBatchWriter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
s3_bucket . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4 array
Scalar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Scanner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
to_arrow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
to_duckdb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
unify_schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
value_counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
write_arrow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
write_csv_arrow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
write_dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
write_feather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
write_parquet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
write_to_raw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Index 79
Description
An Array is an immutable data array with some logical type and some length. Most logical types
are contained in the base Array class; there are also subclasses for DictionaryArray, ListArray,
and StructArray.
Factory
The Array$create() factory method instantiates an Array and takes the following arguments:
• x: an R vector, list, or data.frame
• type: an optional data type for x. If omitted, the type will be inferred from the data.
Array$create() will return the appropriate subclass of Array, such as DictionaryArray when
given an R factor.
To compose a DictionaryArray directly, call DictionaryArray$create(), which takes two ar-
guments:
• x: an R vector or Array of integers for the dictionary indices
• dict: an R vector or Array of dictionary values (like R factor levels but not limited to strings
only)
Usage
a <- Array$create(x)
length(a)
print(a)
a == a
array 5
Methods
• $IsNull(i): Return true if value at index is null. Does not boundscheck
• $IsValid(i): Return true if value at index is valid. Does not boundscheck
• $length(): Size in the number of elements this array contains
• $offset: A relative position into another array’s data, to enable zero-copy slicing
• $null_count: The number of null entries in the array
• $type: logical type of data
• $type_id(): type id
• $Equals(other) : is this array equal to other
• $ApproxEquals(other) :
• $Diff(other) : return a string expressing the difference between two arrays
• $data(): return the underlying ArrayData
• $as_vector(): convert to an R vector
• $ToString(): string representation of the array
• $Slice(offset, length = NULL): Construct a zero-copy slice of the array with the indicated
offset and length. If length is NULL, the slice goes until the end of the array.
• $Take(i): return an Array with values at positions given by integers (R vector or Array Array)
i.
• $Filter(i, keep_na = TRUE): return an Array with values at positions where logical vector (or
Arrow boolean Array) i is TRUE.
• $SortIndices(descending = FALSE): return an Array of integer positions that can be used to
rearrange the Array in ascending or descending order
• $RangeEquals(other, start_idx, end_idx, other_start_idx) :
• $cast(target_type, safe = TRUE, options = cast_options(safe)): Alter the data in the array to
change its type.
• $View(type): Construct a zero-copy view of this array with the given type.
• $Validate() : Perform any validation checks to determine obvious inconsistencies within the
array’s internal data. This can be an expensive check, potentially O(length)
Examples
# zero-copy slicing; the offset of the new Array will be the same as the index passed to $Slice
new_array <- na_array$Slice(5)
new_array$offset
6 arrow_available
# Compare 2 arrays
na_array2 <- na_array
na_array2 == na_array # element-wise comparison
na_array2$Equals(na_array) # overall comparison
Description
The ArrayData class allows you to get and inspect the data inside an arrow::Array.
Usage
data <- Array$create(x)$data()
data$type
data$length
data$null_count
data$offset
data$buffers
Methods
...
Description
You won’t generally need to call these function, but they’re made available for diagnostic purposes.
Usage
arrow_available()
arrow_with_dataset()
arrow_with_parquet()
arrow_with_s3()
arrow_with_json()
arrow_info 7
Value
See Also
If any of these are FALSE, see vignette("install",package = "arrow") for guidance on rein-
stalling the package.
Examples
arrow_available()
arrow_with_dataset()
arrow_with_parquet()
arrow_with_json()
arrow_with_s3()
Description
This function summarizes a number of build-time configurations and run-time settings for the Arrow
package. It may be useful for diagnostics.
Usage
arrow_info()
Value
A list including version information, boolean "capabilities", and statistics from Arrow’s memory
allocator, and also Arrow’s run-time information.
8 buffer
Description
A Buffer is an object containing a pointer to a piece of contiguous memory with a particular size.
Usage
buffer(x)
Arguments
x R object. Only raw, numeric and integer vectors are currently supported
Value
Factory
Methods
• $is_mutable : is this buffer mutable?
• $ZeroPadding() : zero bytes in padding, i.e. bytes between size and capacity
• $size : size in memory, in bytes
• $capacity: possible capacity, in bytes
Examples
Description
This function provides a lower-level API for calling Arrow functions by their string function name.
You won’t use it directly for most applications. Many Arrow compute functions are mapped to R
methods, and in a dplyr evaluation context, all Arrow functions are callable with an arrow_ prefix.
Usage
call_function(
function_name,
...,
args = list(...),
options = empty_named_list()
)
Arguments
function_name string Arrow compute function name
... Function arguments, which may include Array, ChunkedArray, Scalar, RecordBatch,
or Table.
args list arguments as an alternative to specifying in ...
options named list of C++ function options.
Details
When passing indices in ..., args, or options, express them as 0-based integers (consistent with
C++).
Value
An Array, ChunkedArray, Scalar, RecordBatch, or Table, whatever the compute function results
in.
See Also
Arrow C++ documentation for the functions and their respective options.
Examples
a <- Array$create(rnorm(10000))
call_function("quantile", a, options = list(q = seq(0, 1, 0.25)))
Description
A ChunkedArray is a data structure managing a list of primitive Arrow Arrays logically as one large
array. Chunked arrays may be grouped together in a Table.
Usage
chunked_array(..., type = NULL)
Arguments
... Vectors to coerce
type currently ignored
Factory
The ChunkedArray$create() factory method instantiates the object from various Arrays or R
vectors. chunked_array() is an alias for it.
Methods
• $length(): Size in the number of elements this array contains
• $chunk(i): Extract an Array chunk by integer position
• $as_vector(): convert to an R vector
• $Slice(offset, length = NULL): Construct a zero-copy slice of the array with the indicated
offset and length. If length is NULL, the slice goes until the end of the array.
• $Take(i): return a ChunkedArray with values at positions given by integers i. If i is an Arrow
Array or ChunkedArray, it will be coerced to an R vector before taking.
• $Filter(i, keep_na = TRUE): return a ChunkedArray with values at positions where logical
vector or Arrow boolean-type (Chunked)Array i is TRUE.
• $SortIndices(descending = FALSE): return an Array of integer positions that can be used to
rearrange the ChunkedArray in ascending or descending order
• $cast(target_type, safe = TRUE, options = cast_options(safe)): Alter the data in the array to
change its type.
• $null_count: The number of null entries in the array
• $chunks: return a list of Arrays
• $num_chunks: integer number of chunks in the ChunkedArray
• $type: logical type of data
• $View(type): Construct a zero-copy view of this ChunkedArray with the given type.
• $Validate(): Perform any validation checks to determine obvious inconsistencies within the
array’s internal data. This can be an expensive check, potentially O(length)
Codec 11
See Also
Array
Examples
# You can combine Take and SortIndices to return a ChunkedArray with 1 chunk
# containing all values, ordered.
class_scores$Take(class_scores$SortIndices(descending = TRUE))
Description
Codecs allow you to create compressed input and output streams.
Factory
The Codec$create() factory method takes the following arguments:
• type: string name of the compression method. Possible values are "uncompressed", "snappy",
"gzip", "brotli", "zstd", "lz4", "lzo", or "bz2". type may be upper- or lower-cased. Not
all methods may be available; support depends on build-time flags for the C++ library. See
codec_is_available(). Most builds support at least "snappy" and "gzip". All support "un-
compressed".
• compression_level: compression level, the default value (NA) uses the default compression
level for the selected compression type.
12 compression
Description
Support for compression libraries depends on the build-time settings of the Arrow C++ library. This
function lets you know which are available for use.
Usage
codec_is_available(type)
Arguments
type A string, one of "uncompressed", "snappy", "gzip", "brotli", "zstd", "lz4", "lzo",
or "bz2", case insensitive.
Value
Logical: is type available?
Examples
codec_is_available("gzip")
Description
CompressedInputStream and CompressedOutputStream allow you to apply a compression Codec
to an input or output stream.
Factory
The CompressedInputStream$create() and CompressedOutputStream$create() factory meth-
ods instantiate the object and take the following arguments:
• stream An InputStream or OutputStream, respectively
• codec A Codec, either a Codec instance or a string
• compression_level compression level for when the codec argument is given as a string
Methods
Methods are inherited from InputStream and OutputStream, respectively
copy_files 13
Description
Copy files between FileSystems
Usage
copy_files(from, to, chunk_size = 1024L * 1024L)
Arguments
from A string path to a local directory or file, a URI, or a SubTreeFileSystem. Files
will be copied recursively from this path.
to A string path to a local directory or file, a URI, or a SubTreeFileSystem. Di-
rectories will be created as necessary
chunk_size The maximum size of block to read before flushing to the destination file. A
larger chunk_size will use more memory while copying but may help accom-
modate high latency FileSystems.
Value
Nothing: called for side effects in the file system
Examples
Description
Manage the global CPU thread pool in libarrow
14 create_package_with_all_dependencies
Usage
cpu_count()
set_cpu_count(num_threads)
Arguments
num_threads integer: New number of threads for thread pool
create_package_with_all_dependencies
Create a source bundle that includes all thirdparty dependencies
Description
Create a source bundle that includes all thirdparty dependencies
Usage
create_package_with_all_dependencies(dest_file = NULL, source_file = NULL)
Arguments
dest_file File path for the new tar.gz package. Defaults to arrow_V.V.V_with_deps.tar.gz
in the current directory (V.V.V is the version)
source_file File path for the input tar.gz package. Defaults to downloading the package from
CRAN (or whatever you have set as the first in getOption("repos"))
Value
The full path to dest_file, invisibly
This function is used for setting up an offline build. If it’s possible to download at build time, don’t
use this function. Instead, let cmake download the required dependencies for you. These down-
loaded dependencies are only used in the build if ARROW_DEPENDENCY_SOURCE is unset, BUNDLED,
or AUTO. https://arrow.apache.org/docs/developers/cpp/building.html#offline-builds
If you’re using binary packages you shouldn’t need to use this function. You should download the
appropriate binary from your package repository, transfer that to the offline computer, and install
that. Any OS can create the source bundle, but it cannot be installed on Windows. (Instead, use a
standard Windows binary package.)
Note if you’re using RStudio Package Manager on Linux: If you still want to make a source bundle
with this function, make sure to set the first repo in options("repos") to be a mirror that contains
source packages (that is: something other than the RSPM binary mirror URLs).
Steps for an offline install with optional dependencies::
Using a computer with internet access, pre-download the dependencies::
• Install the arrow package or run source("https://raw.githubusercontent.com/apache/arrow/master/r/R/in
CsvReadOptions 15
• Run create_package_with_all_dependencies("my_arrow_pkg.tar.gz")
• Copy the newly created my_arrow_pkg.tar.gz to the computer without internet access
On the computer without internet access, install the prepared package::
• Install the arrow package from the copied file
– install.packages("my_arrow_pkg.tar.gz",dependencies = c("Depends","Imports","LinkingTo"))
– This installation will build from source, so cmake must be available
• Run arrow_info() to check installed capabilities
Examples
## Not run:
new_pkg <- create_package_with_all_dependencies()
# Note: this works when run in the same R session, but it's meant to be
# copied to a different computer.
install.packages(new_pkg, dependencies = c("Depends", "Imports", "LinkingTo"))
## End(Not run)
Description
CsvReadOptions, CsvParseOptions, CsvConvertOptions, JsonReadOptions, JsonParseOptions,
and TimestampParser are containers for various file reading options. See their usage in read_csv_arrow()
and read_json_arrow(), respectively.
Factory
The CsvReadOptions$create() and JsonReadOptions$create() factory methods take the fol-
lowing arguments:
Active bindings
• column_names: from CsvReadOptions
Description
CsvTableReader and JsonTableReader wrap the Arrow C++ CSV and JSON table readers. See
their usage in read_csv_arrow() and read_json_arrow(), respectively.
Factory
The CsvTableReader$create() and JsonTableReader$create() factory methods take the fol-
lowing arguments:
• file An Arrow InputStream
• convert_options (CSV only), parse_options, read_options: see CsvReadOptions
• ... additional parameters.
Methods
• $Read(): returns an Arrow Table.
Description
These functions create type objects corresponding to Arrow types. Use them when defining a
schema() or as inputs to other types, like struct. Most of these functions don’t take arguments,
but a few do.
Usage
int8()
int16()
int32()
int64()
uint8()
18 data-type
uint16()
uint32()
uint64()
float16()
halffloat()
float32()
float()
float64()
boolean()
bool()
utf8()
large_utf8()
binary()
large_binary()
fixed_size_binary(byte_width)
string()
date32()
date64()
null()
decimal(precision, scale)
struct(...)
data-type 19
list_of(type)
large_list_of(type)
fixed_size_list_of(type, list_size)
Arguments
byte_width byte width for FixedSizeBinary type.
unit For time/timestamp types, the time unit. time32() can take either "s" or "ms",
while time64() can be "us" or "ns". timestamp() can take any of those four
values.
timezone For timestamp(), an optional time zone string.
precision For decimal(), precision
scale For decimal(), scale
... For struct(), a named list of types to define the struct columns
type For list_of(), a data type to make a list-of-type
list_size list size for FixedSizeList type.
Details
A few functions have aliases:
date32() creates a datetime type with a "day" unit, like the R Date class. date64() has a "ms"
unit.
uint32 (32 bit unsigned integer), uint64 (64 bit unsigned integer), and int64 (64-bit signed in-
teger) types may contain values that exceed the range of R’s integer type (32-bit signed inte-
ger). When these arrow objects are translated to R objects, uint32 and uint64 are converted to
double ("numeric") and int64 is converted to bit64::integer64. For int64 types, this con-
version can be disabled (so that int64 always yields a bit64::integer64 object) by setting
options(arrow.int64_downcast = FALSE).
Value
An Arrow type object inheriting from DataType.
See Also
dictionary() for creating a dictionary (factor-like) type.
20 Dataset
Examples
bool()
struct(a = int32(), b = double())
timestamp("ms", timezone = "CEST")
time64("ns")
Description
Arrow Datasets allow you to query against data that has been split across multiple files. This shard-
ing of data may indicate partitioning, which can accelerate queries that only touch some partitions
(files).
A Dataset contains one or more Fragments, such as files, of potentially differing type and parti-
tioning.
For Dataset$create(), see open_dataset(), which is an alias for it.
DatasetFactory is used to provide finer control over the creation of Datasets.
Factory
DatasetFactory is used to create a Dataset, inspect the Schema of the fragments contained in
it, and declare a partitioning. FileSystemDatasetFactory is a subclass of DatasetFactory for
discovering files in the local file system, the only currently supported file system.
For the DatasetFactory$create() factory method, see dataset_factory(), an alias for it. A
DatasetFactory has:
• $Inspect(unify_schemas): If unify_schemas is TRUE, all fragments will be scanned and a
unified Schema will be created from them; if FALSE (default), only the first fragment will be
inspected for its schema. Use this fast path when you know and trust that all fragments have
an identical schema.
• $Finish(schema, unify_schemas): Returns a Dataset. If schema is provided, it will be used
for the Dataset; if omitted, a Schema will be created from inspecting the fragments (files) in
the dataset, following unify_schemas as described above.
• filesystem: A FileSystem
• selector: Either a FileSelector or NULL
• paths: Either a character vector of file paths or NULL
• format: A FileFormat
• partitioning: Either Partitioning, PartitioningFactory, or NULL
dataset_factory 21
Methods
A Dataset has the following methods:
• $NewScan(): Returns a ScannerBuilder for building a query
• $schema: Active binding that returns the Schema of the Dataset; you may also replace the
dataset’s schema by using ds$schema <-new_schema. This method currently supports only
adding, removing, or reordering fields in the schema: you cannot alter or cast the field types.
See Also
open_dataset() for a simple interface to creating a Dataset
Description
A Dataset can constructed using one or more DatasetFactorys. This function helps you construct a
DatasetFactory that you can pass to open_dataset().
Usage
dataset_factory(
x,
filesystem = NULL,
format = c("parquet", "arrow", "ipc", "feather", "csv", "tsv", "text"),
partitioning = NULL,
...
)
Arguments
x A string path to a directory containing data files, a vector of one one or more
string paths to data files, or a list of DatasetFactory objects whose datasets
should be combined. If this argument is specified it will be used to construct a
UnionDatasetFactory and other arguments will be ignored.
filesystem A FileSystem object; if omitted, the FileSystem will be detected from x
format A FileFormat object, or a string identifier of the format of the files in x. Currently
supported values:
22 DataType
• "parquet"
• "ipc"/"arrow"/"feather", all aliases for each other; for Feather, note that only
version 2 files are supported
• "csv"/"text", aliases for the same thing (because comma is the default de-
limiter for text files
• "tsv", equivalent to passing format = "text", delimiter = "\t"
Default is "parquet", unless a delimiter is also specified, in which case it is
assumed to be "text".
partitioning One of
• A Schema, in which case the file paths relative to sources will be parsed,
and path segments will be matched with the schema fields. For example,
schema(year = int16(),month = int8()) would create partitions for file
paths like "2019/01/file.parquet", "2019/02/file.parquet", etc.
• A character vector that defines the field names corresponding to those path
segments (that is, you’re providing the names that would correspond to a
Schema but the types will be autodetected)
• A HivePartitioning or HivePartitioningFactory, as returned by hive_partition()
which parses explicit or autodetected fields from Hive-style path segments
• NULL for no partitioning
... Additional format-specific options, passed to FileFormat$create(). For CSV
options, note that you can specify them either with the Arrow C++ library nam-
ing ("delimiter", "quoting", etc.) or the readr-style naming used in read_csv_arrow()
("delim", "quote", etc.). Not all readr options are currently supported; please
file an issue if you encounter one that arrow should support.
Details
If you would only have a single DatasetFactory (for example, you have a single directory con-
taining Parquet files), you can call open_dataset() directly. Use dataset_factory() when you
want to combine different directories, file systems, or file formats.
Value
A DatasetFactory object. Pass this to open_dataset(), in a list potentially with other DatasetFactory
objects, to create a Dataset.
Description
class arrow::DataType
Methods
TODO
dictionary 23
Description
Usage
Arguments
Value
A DictionaryType
See Also
Description
class DictionaryType
Methods
TODO
24 FeatherReader
Description
Expressions are used to define filter logic for passing to a Dataset Scanner.
Expression$scalar(x) constructs an Expression which always evaluates to the provided scalar
(length-1) R value.
Expression$field_ref(name) is used to construct an Expression which evaluates to the named
column in the Dataset against which it is evaluated.
Expression$create(function_name,...,options) builds a function-call Expression contain-
ing one or more Expressions.
Description
This class enables you to interact with Feather files. Create one to connect to a file or other Input-
Stream, and call Read() on it to make an arrow::Table. See its usage in read_feather().
Factory
The FeatherReader$create() factory method instantiates the object and takes the following ar-
gument:
Methods
• $Read(columns): Returns a Table of the selected columns, a vector of integer indices
• $column_names: Active binding, returns the column names in the Feather file
• $schema: Active binding, returns the schema of the Feather file
• $version: Active binding, returns 1 or 2, according to the Feather file version
Field 25
Description
field() lets you create an arrow::Field that maps a DataType to a column name. Fields are
contained in Schemas.
Usage
Arguments
Methods
Examples
field("x", int32())
Description
A FileFormat holds information about how to read and parse the files included in a Dataset. There
are subclasses corresponding to the supported file formats (ParquetFileFormat and IpcFileFormat).
26 FileFormat
Factory
Examples
Description
Methods
• base_name() : The file base name (component after the last directory separator).
• extension() : The file extension
Active bindings
• $type: The file type
• $path: The full file path in the filesystem
• $size: The size in bytes, if available. Only regular files are guaranteed to have a size.
• $mtime: The time of last modification, if available.
Description
file selector
Factory
The $create() factory method instantiates a FileSelector given the 3 fields described below.
Fields
• base_dir: The directory in which to select files. If the path exists but doesn’t point to a
directory, this should be an error.
• allow_not_found: The behavior if base_dir doesn’t exist in the filesystem. If FALSE, an
error is returned. If TRUE, an empty selection is returned
• recursive: Whether to recurse into subdirectories.
28 FileSystem
Description
Factory
• anonymous: logical, default FALSE. If true, will not attempt to look up credentials using stan-
dard AWS configuration methods.
• access_key, secret_key: authentication credentials. If one is provided, the other must be as
well. If both are provided, they will override any AWS configuration set at the environment
level.
• session_token: optional string for authentication along with access_key and secret_key
• role_arn: string AWS ARN of an AccessRole. If provided instead of access_key and
secret_key, temporary credentials will be fetched by assuming this role.
• session_name: optional string identifier for the assumed role session.
• external_id: optional unique string identifier that might be required when you assume a role
in another account.
• load_frequency: integer, frequency (in seconds) with which temporary credentials from an
assumed role session will be refreshed. Default is 900 (i.e. 15 minutes)
• region: AWS region to connect to. If omitted, the AWS library will provide a sensible default
based on client configuration, falling back to "us-east-1" if no other alternatives are found.
• endpoint_override: If non-empty, override region with a connect string such as "local-
host:9000". This is useful for connecting to file systems that emulate S3.
• scheme: S3 connection transport (default "https")
• background_writes: logical, whether OutputStream writes will be issued in the back-
ground, without blocking (default TRUE)
FileWriteOptions 29
Methods
• $GetFileInfo(x): x may be a FileSelector or a character vector of paths. Returns a list of
FileInfo
• $CreateDir(path, recursive = TRUE): Create a directory and subdirectories.
• $DeleteDir(path): Delete a directory and its contents, recursively.
• $DeleteDirContents(path): Delete a directory’s contents, recursively. Like $DeleteDir(), but
doesn’t delete the directory itself. Passing an empty path ("") will wipe the entire filesystem
tree.
• $DeleteFile(path) : Delete a file.
• $DeleteFiles(paths) : Delete many files. The default implementation issues individual delete
operations in sequence.
• $Move(src, dest): Move / rename a file or directory. If the destination exists: if it is a non-
empty directory, an error is returned otherwise, if it has the same type as the source, it is
replaced otherwise, behavior is unspecified (implementation-dependent).
• $CopyFile(src, dest): Copy a file. If the destination exists and is a directory, an error is
returned. Otherwise, it is replaced.
• $OpenInputStream(path): Open an input stream for sequential reading.
• $OpenInputFile(path): Open an input file for random access reading.
• $OpenOutputStream(path): Open an output stream for sequential writing.
• $OpenAppendStream(path): Open an output stream for appending.
Active bindings
• $type_name: string filesystem type name, such as "local", "s3", etc.
• $region: string AWS region, for S3FileSystem and SubTreeFileSystem containing a S3FileSystem
• $base_fs: for SubTreeFileSystem, the FileSystem it contains
• $base_path: for SubTreeFileSystem, the path in $base_fs which is considered root in this
SubTreeFileSystem.
Description
A FileWriteOptions holds write options specific to a FileFormat.
Description
class arrow::FixedWidthType
Methods
TODO
30 flight_get
Description
Connect to a Flight server
Usage
flight_connect(host = "localhost", port, scheme = "grpc+tcp")
Arguments
host string hostname to connect to
port integer port to connect on
scheme URL scheme, default is "grpc+tcp"
Value
A pyarrow.flight.FlightClient.
Description
Get data from a Flight server
Usage
flight_get(client, path)
Arguments
client pyarrow.flight.FlightClient, as returned by flight_connect()
path string identifier under which data is stored
Value
A Table
flight_put 31
Description
Send data to a Flight server
Usage
flight_put(client, data, path, overwrite = TRUE)
Arguments
client pyarrow.flight.FlightClient, as returned by flight_connect()
data data.frame, RecordBatch, or Table to upload
path string identifier to store the data under
overwrite logical: if path exists on client already, should we replace it with the contents
of data? Default is TRUE; if FALSE and path exists, the function will error.
Value
client, invisibly.
Description
A FragmentScanOptions holds options specific to a FileFormat and a scan operation.
Factory
FragmentScanOptions$create() takes the following arguments:
– pre_buffer: Pre-buffer the raw Parquet data. This can improve performance on high-
latency filesystems. Disabled by default. format = "text": see CsvConvertOptions.
Note that options can only be specified with the Arrow C++ library naming. Also,
"block_size" from CsvReadOptions may be given.
Description
Hive partitioning embeds field names and values in path segments, such as "/year=2019/month=2/data.parquet".
Usage
Arguments
Details
Because fields are named in the path segments, order of fields passed to hive_partition() does
not matter.
Value
Examples
Description
RandomAccessFile inherits from InputStream and is a base class for: ReadableFile for reading
from a file; MemoryMappedFile for the same but with memory mapping; and BufferReader for
reading from a buffer. Use these with the various table readers.
Factory
The $create() factory methods instantiate the InputStream object and take the following arguments,
depending on the subclass:
Methods
• $GetSize():
• $supports_zero_copy(): Logical
• $seek(position): go to that position in the stream
• $tell(): return the position in the stream
• $close(): close the stream
• $Read(nbytes): read data from the stream, either a specified nbytes or all, if nbytes is not
provided
• $ReadAt(position, nbytes): similar to $seek(position)$Read(nbytes)
• $Resize(size): for a MemoryMappedFile that is writeable
Description
Use this function to install the latest release of arrow, to switch to or from a nightly development
version, or on Linux to try reinstalling with all necessary C++ dependencies.
34 install_arrow
Usage
install_arrow(
nightly = FALSE,
binary = Sys.getenv("LIBARROW_BINARY", TRUE),
use_system = Sys.getenv("ARROW_USE_PKG_CONFIG", FALSE),
minimal = Sys.getenv("LIBARROW_MINIMAL", FALSE),
verbose = Sys.getenv("ARROW_R_DEV", FALSE),
repos = getOption("repos"),
...
)
Arguments
nightly logical: Should we install a development version of the package, or should we
install from CRAN (the default).
binary On Linux, value to set for the environment variable LIBARROW_BINARY, which
governs how C++ binaries are used, if at all. The default value, TRUE, tells the
installation script to detect the Linux distribution and version and find an appro-
priate C++ library. FALSE would tell the script not to retrieve a binary and instead
build Arrow C++ from source. Other valid values are strings corresponding to
a Linux distribution-version, to override the value that would be detected. See
vignette("install",package = "arrow") for further details.
use_system logical: Should we use pkg-config to look for Arrow system packages? De-
fault is FALSE. If TRUE, source installation may be faster, but there is a risk of
version mismatch. This sets the ARROW_USE_PKG_CONFIG environment variable.
minimal logical: If building from source, should we build without optional dependencies
(compression libraries, for example)? Default is FALSE. This sets the LIBARROW_MINIMAL
environment variable.
verbose logical: Print more debugging output when installing? Default is FALSE. This
sets the ARROW_R_DEV environment variable.
repos character vector of base URLs of the repositories to install from (passed to
install.packages())
... Additional arguments passed to install.packages()
Details
Note that, unlike packages like tensorflow, blogdown, and others that require external dependen-
cies, you do not need to run install_arrow() after a successful arrow installation.
See Also
arrow_available() to see if the package was configured with necessary C++ dependencies. vignette("install",package
= "arrow") for more ways to tune installation on Linux.
install_pyarrow 35
Description
pyarrow is the Python package for Apache Arrow. This function helps with installing it for use
with reticulate.
Usage
Arguments
envname The name or full path of the Python environment to install into. This can be a vir-
tualenv or conda environment created by reticulate. See reticulate::py_install().
nightly logical: Should we install a development version of the package? Default is to
use the official release version.
... additional arguments passed to reticulate::py_install().
Description
Usage
io_thread_count()
set_io_thread_count(num_threads)
Arguments
list_compute_functions
List available Arrow C++ compute functions
Description
This function lists the names of all available Arrow C++ library compute functions. These can be
called by passing to call_function(), or they can be called by name with an arrow_ prefix inside
a dplyr verb.
Usage
Arguments
Details
The resulting list describes the capabilities of your arrow build. Some functions, such as string and
regular expression functions, require optional build-time C++ dependencies. If your arrow package
was not compiled with those features enabled, those functions will not appear in this list.
Some functions take options that need to be passed when calling them (in a list called options).
These options require custom handling in C++; many functions already have that handling set up
but not all do. If you encounter one that needs special handling for options, please report an issue.
Note that this list does not enumerate all of the R bindings for these functions. The package includes
Arrow methods for many base R functions that can be called directly on Arrow objects, as well as
some tidyverse-flavored versions available inside dplyr verbs.
Value
Examples
Description
See available resources on a Flight server
Usage
list_flights(client)
flight_path_exists(client, path)
Arguments
client pyarrow.flight.FlightClient, as returned by flight_connect()
path string identifier under which data is stored
Value
list_flights() returns a character vector of paths. flight_path_exists() returns a logical
value, the equivalent of path %in% list_flights()
Description
Load a Python Flight server
Usage
load_flight_server(name, path = system.file(package = "arrow"))
Arguments
name string Python module name
path file system path where the Python module is found. Default is to look in the inst/
directory for included modules.
Examples
load_flight_server("demo_flight_server")
38 match_arrow
Description
As an alternative to calling collect() on a Dataset query, you can use this function to access
the stream of RecordBatches in the Dataset. This lets you aggregate on each chunk and pull the
intermediate results into a data.frame for further aggregation, even if you couldn’t fit the whole
Dataset result in memory.
Usage
map_batches(X, FUN, ..., .data.frame = TRUE)
Arguments
X A Dataset or arrow_dplyr_query object, as returned by the dplyr methods
on Dataset.
FUN A function or purrr-style lambda expression to apply to each batch
... Additional arguments passed to FUN
.data.frame logical: collect the resulting chunks into a single data.frame? Default TRUE
Details
This is experimental and not recommended for production use.
Description
base::match() is not a generic, so we can’t just define Arrow methods for it. This function exposes
the analogous functions in the Arrow C++ library.
Usage
match_arrow(x, table, ...)
Arguments
x Scalar, Array or ChunkedArray
table Scalar, Array, ChunkedArray‘, or R vector lookup table.
... additional arguments, ignored
Message 39
Value
match_arrow() returns an int32-type Arrow object of the same length and type as x with the (0-
based) indexes into table. is_in() returns a boolean-type Arrow object of the same length and
type as x with values indicating per element of x it it is present in table.
Examples
# Although there are multiple matches, you are returned the index of the first
# match, as with the base R equivalent
match(4, mtcars$cyl) # 1-indexed
match_arrow(Scalar$create(4), cars_tbl$cyl) # 0-indexed
# If `x` contains multiple values, you are returned the indices of the first
# match for each value.
match(c(4, 6, 8), mtcars$cyl)
match_arrow(Array$create(c(4, 6, 8)), cars_tbl$cyl)
Description
class arrow::Message
Methods
TODO
40 mmap_open
Description
class arrow::MessageReader
Methods
TODO
Description
Create a new read/write memory mapped file of a given size
Usage
mmap_create(path, size)
Arguments
path file path
size size in bytes
Value
a arrow::io::MemoryMappedFile
Description
Open a memory mapped file
Usage
mmap_open(path, mode = c("read", "write", "readwrite"))
Arguments
path file path
mode file mode (read/write/readwrite)
open_dataset 41
Description
Arrow Datasets allow you to query against data that has been split across multiple files. This shard-
ing of data may indicate partitioning, which can accelerate queries that only touch some partitions
(files). Call open_dataset() to point to a directory of data files and return a Dataset, then use
dplyr methods to query it.
Usage
open_dataset(
sources,
schema = NULL,
partitioning = hive_partition(),
unify_schemas = NULL,
format = c("parquet", "arrow", "ipc", "feather", "csv", "tsv", "text"),
...
)
Arguments
sources One of:
• a string path or URI to a directory containing data files
• a string path or URI to a single file
• a character vector of paths or URIs to individual data files
• a list of Dataset objects as created by this function
• a list of DatasetFactory objects as created by dataset_factory().
When sources is a vector of file URIs, they must all use the same protocol and
point to files located in the same file system and having the same format.
schema Schema for the Dataset. If NULL (the default), the schema will be inferred from
the data sources.
partitioning When sources is a directory path/URI, one of:
• a Schema, in which case the file paths relative to sources will be parsed,
and path segments will be matched with the schema fields. For example,
schema(year = int16(),month = int8()) would create partitions for file
paths like "2019/01/file.parquet", "2019/02/file.parquet", etc.
• a character vector that defines the field names corresponding to those path
segments (that is, you’re providing the names that would correspond to a
Schema but the types will be autodetected)
• a HivePartitioning or HivePartitioningFactory, as returned by hive_partition()
which parses explicit or autodetected fields from Hive-style path segments
• NULL for no partitioning
42 open_dataset
Value
A Dataset R6 object. Use dplyr methods on it to query the data, or call $NewScan() to construct a
query directly.
See Also
vignette("dataset",package = "arrow")
Examples
# You can specify a directory containing the files for your dataset and
# open_dataset will scan all files in your directory.
open_dataset(tf)
OutputStream 43
## You must specify the file format if using a format other than parquet.
tf2 <- tempfile()
dir.create(tf2)
on.exit(unlink(tf2))
write_dataset(data, tf2, format = "ipc")
# This line will results in errors when you try to work with the data
## Not run:
open_dataset(tf2)
## End(Not run)
# This line will work
open_dataset(tf2, format = "ipc")
# View files - you can see the partitioning means that files have been written
# to folders based on Month/Day values
tf3_files <- list.files(tf3, recursive = TRUE)
# With no partitioning specified, dataset contains all files but doesn't include
# directory names as field names
open_dataset(tf3)
# Now that partitioning has been specified, your dataset contains columns for Month and Day
open_dataset(tf3, partitioning = c("Month", "Day"))
# If you want to specify the data types for your fields, you can pass in a Schema
open_dataset(tf3, partitioning = schema(Month = int8(), Day = int8()))
Description
FileOutputStream is for writing to a file; BufferOutputStream writes to a buffer; You can create
one and pass it to any of the table writers, for example.
Factory
The $create() factory methods instantiate the OutputStream object and take the following argu-
ments, depending on the subclass:
44 ParquetFileReader
Methods
• $tell(): return the position in the stream
• $close(): close the stream
• $write(x): send x to the stream
• $capacity(): for BufferOutputStream
• $finish(): for BufferOutputStream
• $GetExtentBytesWritten(): for MockOutputStream, report how many bytes were sent.
ParquetArrowReaderProperties
ParquetArrowReaderProperties class
Description
This class holds settings to control how a Parquet file is read by ParquetFileReader.
Factory
Methods
• $read_dictionary(column_index)
• $set_read_dictionary(column_index, read_dict)
• $use_threads(use_threads)
Description
Factory
The ParquetFileReader$create() factory method instantiates the object and takes the following
arguments:
• file A character file name, raw vector, or Arrow file connection object (e.g. RandomAccessFile).
• props Optional ParquetArrowReaderProperties
• mmap Logical: whether to memory-map the file (default TRUE)
• ... Additional arguments, currently ignored
Methods
• $ReadTable(column_indices): get an arrow::Table from the file. The optional column_indices=
argument is a 0-based integer vector indicating which columns to retain.
• $ReadRowGroup(i, column_indices): get an arrow::Table by reading the ith row group (0-
based). The optional column_indices= argument is a 0-based integer vector indicating which
columns to retain.
• $ReadRowGroups(row_groups, column_indices): get an arrow::Table by reading several
row groups (0-based integers). The optional column_indices= argument is a 0-based integer
vector indicating which columns to retain.
• $GetSchema(): get the arrow::Schema of the data in the file
• $ReadColumn(i): read the ith column (0-based) as a ChunkedArray.
Active bindings
• $num_rows: number of rows.
• $num_columns: number of columns.
• $num_row_groups: number of row groups.
Examples
Description
This class enables you to interact with Parquet files.
Factory
The ParquetFileWriter$create() factory method instantiates the object and takes the following
arguments:
• schema A Schema
• sink An arrow::io::OutputStream
• properties An instance of ParquetWriterProperties
• arrow_properties An instance of ParquetArrowWriterProperties
Methods
• WriteTable Write a Table to sink
• Close Close the writer. Note: does not close the sink. arrow::io::OutputStream has its own
close() method.
ParquetWriterProperties
ParquetWriterProperties class
Description
This class holds settings to control how a Parquet file is read by ParquetFileWriter.
Details
The parameters compression, compression_level, use_dictionary and write_statistics‘ sup-
port various patterns:
• The default NULL leaves the parameter unspecified, and the C++ library uses an appropriate
default for each column (defaults listed above)
• A single, unnamed, value (e.g. a single string for compression) applies to all columns
• An unnamed vector, of the same size as the number of columns, to specify a value for each
column, in positional order
• A named vector, to specify the value for the named columns, the default value for the setting
is used when not supplied
Unlike the high-level write_parquet, ParquetWriterProperties arguments use the C++ defaults.
Currently this means "uncompressed" rather than "snappy" for the compression argument.
Partitioning 47
Factory
The ParquetWriterProperties$create() factory method instantiates the object and takes the
following arguments:
See Also
write_parquet
Schema for information about schemas and metadata handling.
Description
Factory
Both DirectoryPartitioning$create() and HivePartitioning$create() methods take a Schema
as a single input argument. The helper function hive_partition(...) is shorthand for HivePartitioning$create(schema
With DirectoryPartitioningFactory$create(), you can provide just the names of the path
segments (in our example, c("year","month")), and the DatasetFactory will infer the data types
for those partition variables. HivePartitioningFactory$create() takes no arguments: both
variable names and their types can be inferred from the file paths. hive_partition() with no
arguments returns a HivePartitioningFactory.
Description
Apache Arrow defines two formats for serializing data for interprocess communication (IPC): a
"stream" format and a "file" format, known as Feather. read_ipc_stream() and read_feather()
read those formats, respectively.
Usage
read_arrow(file, ...)
Arguments
file A character file name or URI, raw vector, an Arrow input stream, or a FileSystem
with path (SubTreeFileSystem). If a file name or URI, an Arrow InputStream
will be opened and closed when finished. If an input stream is provided, it will
be left open.
... extra parameters passed to read_feather().
as_data_frame Should the function return a data.frame (default) or an Arrow Table?
Details
read_arrow(), a wrapper around read_ipc_stream() and read_feather(), is deprecated. You
should explicitly choose the function that will read the desired IPC format (stream or file) since a
file or InputStream may contain either.
Value
A data.frame if as_data_frame is TRUE (the default), or an Arrow Table otherwise
See Also
read_feather() for writing IPC files. RecordBatchReader for a lower-level interface.
read_delim_arrow 49
Description
These functions uses the Arrow C++ CSV reader to read into a data.frame. Arrow C++ options
have been mapped to argument names that follow those of readr::read_delim(), and col_select
was inspired by vroom::vroom().
Usage
read_delim_arrow(
file,
delim = ",",
quote = "\"",
escape_double = TRUE,
escape_backslash = FALSE,
schema = NULL,
col_names = TRUE,
col_types = NULL,
col_select = NULL,
na = c("", "NA"),
quoted_na = TRUE,
skip_empty_rows = TRUE,
skip = 0L,
parse_options = NULL,
convert_options = NULL,
read_options = NULL,
as_data_frame = TRUE,
timestamp_parsers = NULL
)
read_csv_arrow(
file,
quote = "\"",
escape_double = TRUE,
escape_backslash = FALSE,
schema = NULL,
col_names = TRUE,
col_types = NULL,
col_select = NULL,
na = c("", "NA"),
quoted_na = TRUE,
skip_empty_rows = TRUE,
skip = 0L,
parse_options = NULL,
convert_options = NULL,
50 read_delim_arrow
read_options = NULL,
as_data_frame = TRUE,
timestamp_parsers = NULL
)
read_tsv_arrow(
file,
quote = "\"",
escape_double = TRUE,
escape_backslash = FALSE,
schema = NULL,
col_names = TRUE,
col_types = NULL,
col_select = NULL,
na = c("", "NA"),
quoted_na = TRUE,
skip_empty_rows = TRUE,
skip = 0L,
parse_options = NULL,
convert_options = NULL,
read_options = NULL,
as_data_frame = TRUE,
timestamp_parsers = NULL
)
Arguments
file A character file name or URI, raw vector, an Arrow input stream, or a FileSystem
with path (SubTreeFileSystem). If a file name, a memory-mapped Arrow In-
putStream will be opened and closed when finished; compression will be de-
tected from the file extension and handled automatically. If an input stream is
provided, it will be left open.
delim Single character used to separate fields within a record.
quote Single character used to quote strings.
escape_double Does the file escape quotes by doubling them? i.e. If this option is TRUE, the
value """" represents a single quote, \".
escape_backslash
Does the file use backslashes to escape special characters? This is more gen-
eral than escape_double as backslashes can be used to escape the delimiter
character, the quote character, or to add special characters like \\n.
schema Schema that describes the table. If provided, it will be used to satisfy both
col_names and col_types.
col_names If TRUE, the first row of the input will be used as the column names and will
not be included in the data frame. If FALSE, column names will be generated
by Arrow, starting with "f0", "f1", ..., "fN". Alternatively, you can specify a
character vector of column names.
read_delim_arrow 51
col_types A compact string representation of the column types, or NULL (the default) to
infer types from the data.
col_select A character vector of column names to keep, as in the "select" argument to
data.table::fread(), or a tidy selection specification of columns, as used in
dplyr::select().
na A character vector of strings to interpret as missing values.
quoted_na Should missing values inside quotes be treated as missing values (the default)
or strings. (Note that this is different from the the Arrow C++ default for the
corresponding convert option, strings_can_be_null.)
skip_empty_rows
Should blank rows be ignored altogether? If TRUE, blank rows will not be repre-
sented at all. If FALSE, they will be filled with missings.
skip Number of lines to skip before reading data.
parse_options see file reader options. If given, this overrides any parsing options provided in
other arguments (e.g. delim, quote, etc.).
convert_options
see file reader options
read_options see file reader options
as_data_frame Should the function return a data.frame (default) or an Arrow Table?
timestamp_parsers
User-defined timestamp parsers. If more than one parser is specified, the CSV
conversion logic will try parsing values starting from the beginning of this vec-
tor. Possible values are:
• NULL: the default, which uses the ISO-8601 parser
• a character vector of strptime parse strings
• a list of TimestampParser objects
Details
Value
Examples
tf <- tempfile()
on.exit(unlink(tf))
write.csv(mtcars, file = tf)
df <- read_csv_arrow(tf)
dim(df)
# Can select columns
df <- read_csv_arrow(tf, col_select = starts_with("d"))
read_feather 53
Description
Feather provides binary columnar serialization for data frames. It is designed to make reading and
writing data frames efficient, and to make sharing data across data analysis languages easy. This
function reads both the original, limited specification of the format and the version 2 specification,
which is the Apache Arrow IPC file format.
Usage
read_feather(file, col_select = NULL, as_data_frame = TRUE, ...)
Arguments
file A character file name or URI, raw vector, an Arrow input stream, or a FileSystem
with path (SubTreeFileSystem). If a file name or URI, an Arrow InputStream
will be opened and closed when finished. If an input stream is provided, it will
be left open.
col_select A character vector of column names to keep, as in the "select" argument to
data.table::fread(), or a tidy selection specification of columns, as used in
dplyr::select().
as_data_frame Should the function return a data.frame (default) or an Arrow Table?
... additional parameters, passed to make_readable_file().
Value
A data.frame if as_data_frame is TRUE (the default), or an Arrow Table otherwise
See Also
FeatherReader and RecordBatchReader for lower-level access to reading Arrow IPC data.
Examples
tf <- tempfile()
on.exit(unlink(tf))
write_feather(mtcars, tf)
df <- read_feather(tf)
dim(df)
# Can select columns
df <- read_feather(tf, col_select = starts_with("d"))
54 read_json_arrow
Description
Using JsonTableReader
Usage
read_json_arrow(
file,
col_select = NULL,
as_data_frame = TRUE,
schema = NULL,
...
)
Arguments
file A character file name or URI, raw vector, an Arrow input stream, or a FileSystem
with path (SubTreeFileSystem). If a file name, a memory-mapped Arrow In-
putStream will be opened and closed when finished; compression will be de-
tected from the file extension and handled automatically. If an input stream is
provided, it will be left open.
col_select A character vector of column names to keep, as in the "select" argument to
data.table::fread(), or a tidy selection specification of columns, as used in
dplyr::select().
as_data_frame Should the function return a data.frame (default) or an Arrow Table?
schema Schema that describes the table.
... Additional options passed to JsonTableReader$create()
Value
A data.frame, or a Table if as_data_frame = FALSE.
Examples
tf <- tempfile()
on.exit(unlink(tf))
writeLines('
{ "hello": 3.5, "world": false, "yo": "thing" }
{ "hello": 3.25, "world": null }
{ "hello": 0.0, "world": true, "yo": null }
', tf, useBytes = TRUE)
df <- read_json_arrow(tf)
read_message 55
Description
Read a Message from a stream
Usage
read_message(stream)
Arguments
stream an InputStream
Description
’Parquet’ is a columnar storage file format. This function enables you to read Parquet files into R.
Usage
read_parquet(
file,
col_select = NULL,
as_data_frame = TRUE,
props = ParquetArrowReaderProperties$create(),
...
)
Arguments
file A character file name or URI, raw vector, an Arrow input stream, or a FileSystem
with path (SubTreeFileSystem). If a file name or URI, an Arrow InputStream
will be opened and closed when finished. If an input stream is provided, it will
be left open.
col_select A character vector of column names to keep, as in the "select" argument to
data.table::fread(), or a tidy selection specification of columns, as used in
dplyr::select().
as_data_frame Should the function return a data.frame (default) or an Arrow Table?
props ParquetArrowReaderProperties
... Additional arguments passed to ParquetFileReader$create()
56 RecordBatch
Value
A arrow::Table, or a data.frame if as_data_frame is TRUE (the default).
Examples
tf <- tempfile()
on.exit(unlink(tf))
write_parquet(mtcars, tf)
df <- read_parquet(tf, col_select = starts_with("d"))
head(df)
Description
read a Schema from a stream
Usage
read_schema(stream, ...)
Arguments
stream a Message, InputStream, or Buffer
... currently ignored
Value
A Schema
Description
A record batch is a collection of equal-length arrays matching a particular Schema. It is a table-like
data structure that is semantically a sequence of fields, each a contiguous Arrow Array.
Usage
record_batch(..., schema = NULL)
RecordBatch 57
Arguments
... A data.frame or a named set of Arrays or vectors. If given a mixture of
data.frames and vectors, the inputs will be autospliced together (see examples).
Alternatively, you can provide a single Arrow IPC InputStream, Message,
Buffer, or R raw object containing a Buffer.
schema a Schema, or NULL (the default) to infer the schema from the data in .... When
providing an Arrow IPC buffer, schema is required.
R6 Methods
In addition to the more R-friendly S3 methods, a RecordBatch object has the following R6 methods
that map onto the underlying C++ methods:
• $Equals(other): Returns TRUE if the other record batch is equal
• $column(i): Extract an Array by integer position from the batch
• $column_name(i): Get a column’s name by integer position
• $names(): Get all column names (called by names(batch))
• $RenameColumns(value): Set all column names (called by names(batch) <-value)
• $GetColumnByName(name): Extract an Array by string name
• $RemoveColumn(i): Drops a column from the batch by integer position
• $SelectColumns(indices): Return a new record batch with a selection of columns, expressed
as 0-based integers.
• $Slice(offset, length = NULL): Create a zero-copy view starting at the indicated integer offset
and going for the given length, or to the end of the table if NULL, the default.
• $Take(i): return an RecordBatch with rows at positions given by integers (R vector or Array
Array) i.
• $Filter(i, keep_na = TRUE): return an RecordBatch with rows at positions where logical
vector (or Arrow boolean Array) i is TRUE.
• $SortIndices(names, descending = FALSE): return an Array of integer row positions that can
be used to rearrange the RecordBatch in ascending or descending order by the first named
column, breaking ties with further named columns. descending can be a logical vector of
length one or of the same length as names.
• $serialize(): Returns a raw vector suitable for interprocess communication
• $cast(target_schema, safe = TRUE, options = cast_options(safe)): Alter the schema of the
record batch.
There are also some active bindings
58 RecordBatchReader
• $num_columns
• $num_rows
• $schema
• $metadata: Returns the key-value metadata of the Schema as a named list. Modify or replace
by assigning in (batch$metadata <-new_metadata). All list elements are coerced to string.
See schema() for more information.
• $columns: Returns a list of Arrays
Examples
Description
Apache Arrow defines two formats for serializing data for interprocess communication (IPC): a
"stream" format and a "file" format, known as Feather. RecordBatchStreamReader and RecordBatchFileReader
are interfaces for accessing record batches from input sources in those formats, respectively.
For guidance on how to use these classes, see the examples section.
Factory
The RecordBatchFileReader$create() and RecordBatchStreamReader$create() factory meth-
ods instantiate the object and take a single argument, named according to the class:
• file A character file name, raw vector, or Arrow file connection object (e.g. RandomAccess-
File).
• stream A raw vector, Buffer, or InputStream.
Methods
• $read_next_batch(): Returns a RecordBatch, iterating through the Reader. If there are no
further batches in the Reader, it returns NULL.
• $schema: Returns a Schema (active binding)
• $batches(): Returns a list of RecordBatches
• $read_table(): Collects the reader’s RecordBatches into a Table
• $get_batch(i): For RecordBatchFileReader, return a particular batch by an integer index.
• $num_record_batches(): For RecordBatchFileReader, see how many batches are in the file.
RecordBatchWriter 59
See Also
read_ipc_stream() and read_feather() provide a much simpler interface for reading data from
these formats and are sufficient for many use cases.
Examples
tf <- tempfile()
on.exit(unlink(tf))
# Now, we have a file we can read from. Same pattern: open file connection,
# then pass it to a RecordBatchReader
read_file_obj <- ReadableFile$create(tf)
reader <- RecordBatchFileReader$create(read_file_obj)
# RecordBatchFileReader knows how many batches it has (StreamReader does not)
reader$num_record_batches
# We could consume the Reader by calling $read_next_batch() until all are,
# consumed, or we can call $read_table() to pull them all into a Table
tab <- reader$read_table()
# Call as.data.frame to turn that Table into an R data.frame
df <- as.data.frame(tab)
# This should be the same data we sent
all.equal(df, chickwts, check.attributes = FALSE)
# Unlike the Writers, we don't have to close RecordBatchReaders,
# but we do still need to close the file connection
read_file_obj$close()
Description
Apache Arrow defines two formats for serializing data for interprocess communication (IPC): a
"stream" format and a "file" format, known as Feather. RecordBatchStreamWriter and RecordBatchFileWriter
are interfaces for writing record batches to those formats, respectively.
60 RecordBatchWriter
For guidance on how to use these classes, see the examples section.
Factory
The RecordBatchFileWriter$create() and RecordBatchStreamWriter$create() factory meth-
ods instantiate the object and take the following arguments:
• sink An OutputStream
• schema A Schema for the data to be written
• use_legacy_format logical: write data formatted so that Arrow libraries versions 0.14 and
lower can read it. Default is FALSE. You can also enable this by setting the environment
variable ARROW_PRE_0_15_IPC_FORMAT=1.
• metadata_version: A string like "V5" or the equivalent integer indicating the Arrow IPC
MetadataVersion. Default (NULL) will use the latest version, unless the environment variable
ARROW_PRE_1_0_METADATA_VERSION=1, in which case it will be V4.
Methods
• $write(x): Write a RecordBatch, Table, or data.frame, dispatching to the methods below
appropriately
• $write_batch(batch): Write a RecordBatch to stream
• $write_table(table): Write a Table to stream
• $close(): close stream. Note that this indicates end-of-file or end-of-stream–it does not close
the connection to the sink. That needs to be closed separately.
See Also
write_ipc_stream() and write_feather() provide a much simpler interface for writing data to
these formats and are sufficient for many use cases. write_to_raw() is a version that serializes
data to a buffer.
Examples
tf <- tempfile()
on.exit(unlink(tf))
# Now, we have a file we can read from. Same pattern: open file connection,
# then pass it to a RecordBatchReader
read_file_obj <- ReadableFile$create(tf)
reader <- RecordBatchFileReader$create(read_file_obj)
# RecordBatchFileReader knows how many batches it has (StreamReader does not)
reader$num_record_batches
# We could consume the Reader by calling $read_next_batch() until all are,
# consumed, or we can call $read_table() to pull them all into a Table
tab <- reader$read_table()
# Call as.data.frame to turn that Table into an R data.frame
df <- as.data.frame(tab)
# This should be the same data we sent
all.equal(df, chickwts, check.attributes = FALSE)
# Unlike the Writers, we don't have to close RecordBatchReaders,
# but we do still need to close the file connection
read_file_obj$close()
Description
s3_bucket() is a convenience function to create an S3FileSystem object that automatically detects
the bucket’s AWS region and holding onto the its relative path.
Usage
s3_bucket(bucket, ...)
Arguments
bucket string S3 bucket name or path
... Additional connection options, passed to S3FileSystem$create()
Value
A SubTreeFileSystem containing an S3FileSystem and the bucket’s relative path. Note that this
function’s success does not guarantee that you are authorized to access the bucket’s contents.
Examples
Description
A Scalar holds a single value of an Arrow type.
Methods
$ToString(): convert to a string $as_vector(): convert to an R vector $as_array(): convert to an
Arrow Array $Equals(other): is this Scalar equal to other $ApproxEquals(other): is this Scalar
approximately equal to other $is_valid: is this Scalar valid $null_count: number of invalid values
- 1 or 0 $type: Scalar type
Examples
Scalar$create(pi)
Scalar$create(404)
# If you pass a vector into Scalar$create, you get a list containing your items
Scalar$create(c(1, 2, 3))
# Comparisons
my_scalar <- Scalar$create(99)
my_scalar$ApproxEquals(Scalar$create(99.00001)) # FALSE
my_scalar$ApproxEquals(Scalar$create(99.000009)) # TRUE
my_scalar$Equals(Scalar$create(99.000009)) # FALSE
my_scalar$Equals(Scalar$create(99L)) # FALSE (types don't match)
my_scalar$ToString()
Description
A Scanner iterates over a Dataset’s fragments and returns data according to given row filtering and
column projection. A ScannerBuilder can help create one.
Factory
Scanner$create() wraps the ScannerBuilder interface to make a Scanner. It takes the following
arguments:
• projection: A character vector of column names to select columns or a named list of expres-
sions
• filter: A Expression to filter the scanned rows by, or TRUE (default) to keep all rows.
• use_threads: logical: should scanning use multithreading? Default TRUE
• use_async: logical: should the async scanner (performs better on high-latency/highly parallel
filesystems like S3) be used? Default FALSE
• ...: Additional arguments, currently ignored
Methods
ScannerBuilder has the following methods:
• $Project(cols): Indicate that the scan should only return columns given by cols, a character
vector of column names
• $Filter(expr): Filter rows by an Expression.
• $UseThreads(threads): logical: should the scan use multithreading? The method’s default
input is TRUE, but you must call the method to enable multithreading because the scanner
default is FALSE.
• $UseAsync(use_async): logical: should the async scanner be used?
• $BatchSize(batch_size): integer: Maximum row count of scanned record batches, default is
32K. If scanned record batches are overflowing memory then this method can be called to
reduce their size.
• $schema: Active binding, returns the Schema of the Dataset
• $Finish(): Returns a Scanner
Scanner currently has a single method, $ToTable(), which evaluates the query and returns an Arrow
Table.
Description
A Schema is a list of Fields, which map names to Arrow data types. Create a Schema when you want
to convert an R data.frame to Arrow but don’t want to rely on the default mapping of R types to
Arrow types, such as when you want to choose a specific numeric precision, or when creating a
Dataset and you want to ensure a specific schema rather than inferring it from the various files.
Many Arrow objects, including Table and Dataset, have a $schema method (active binding) that lets
you access their schema.
Usage
schema(...)
64 Schema
Arguments
... named list containing data types or a list of fields containing the fields for the
schema
Methods
• $ToString(): convert to a string
• $field(i): returns the field at index i (0-based)
• $GetFieldByName(x): returns the field with name x
• $WithMetadata(metadata): returns a new Schema with the key-value metadata set. Note that
all list elements in metadata will be coerced to character.
Active bindings
• $names: returns the field names (called in names(Schema))
• $num_fields: returns the number of fields (called in length(Schema))
• $fields: returns the list of Fields in the Schema, suitable for iterating over
• $HasMetadata: logical: does this Schema have extra metadata?
• $metadata: returns the key-value metadata as a named list. Modify or replace by assigning in
(sch$metadata <-new_metadata). All list elements are coerced to string.
R Metadata
When converting a data.frame to an Arrow Table or RecordBatch, attributes from the data.frame
are saved alongside tables so that the object can be reconstructed faithfully in R (e.g. with as.data.frame()).
This metadata can be both at the top-level of the data.frame (e.g. attributes(df)) or at the col-
umn (e.g. attributes(df$col_a)) or for list columns only: element level (e.g. attributes(df[1,"col_a"])).
For example, this allows for storing haven columns in a table and being able to faithfully re-create
them when pulled back into R. This metadata is separate from the schema (column names and types)
which is compatible with other Arrow clients. The R metadata is only read by R and is ignored by
other clients (e.g. Pandas has its own custom metadata). This metadata is stored in $metadata$r.
Since Schema metadata keys and values must be strings, this metadata is saved by serializing R’s
attribute list structure to a string. If the serialized metadata exceeds 100Kb in size, by default it
is compressed starting in version 3.0.0. To disable this compression (e.g. for tables that are com-
patible with Arrow versions before 3.0.0 and include large amounts of metadata), set the option
arrow.compress_metadata to FALSE. Files with compressed metadata are readable by older ver-
sions of arrow, but the metadata is dropped.
Examples
Description
A Table is a sequence of chunked arrays. They have a similar interface to record batches, but they
can be composed from multiple record batches or chunked arrays.
Usage
arrow_table(..., schema = NULL)
Arguments
... A data.frame or a named set of Arrays or vectors. If given a mixture of
data.frames and named vectors, the inputs will be autospliced together (see
examples). Alternatively, you can provide a single Arrow IPC InputStream,
Message, Buffer, or R raw object containing a Buffer.
schema a Schema, or NULL (the default) to infer the schema from the data in .... When
providing an Arrow IPC buffer, schema is required.
R6 Methods
In addition to the more R-friendly S3 methods, a Table object has the following R6 methods that
map onto the underlying C++ methods:
• $column(i): Extract a ChunkedArray by integer position from the table
• $ColumnNames(): Get all column names (called by names(tab))
• $RenameColumns(value): Set all column names (called by names(tab) <-value)
• $GetColumnByName(name): Extract a ChunkedArray by string name
• $field(i): Extract a Field from the table schema by integer position
• $SelectColumns(indices): Return new Table with specified columns, expressed as 0-based
integers.
• $Slice(offset, length = NULL): Create a zero-copy view starting at the indicated integer offset
and going for the given length, or to the end of the table if NULL, the default.
• $Take(i): return an Table with rows at positions given by integers i. If i is an Arrow Array
or ChunkedArray, it will be coerced to an R vector before taking.
66 to_arrow
• $Filter(i, keep_na = TRUE): return an Table with rows at positions where logical vector or
Arrow boolean-type (Chunked)Array i is TRUE.
• $SortIndices(names, descending = FALSE): return an Array of integer row positions that can
be used to rearrange the Table in ascending or descending order by the first named column,
breaking ties with further named columns. descending can be a logical vector of length one
or of the same length as names.
• $serialize(output_stream, ...): Write the table to the given OutputStream
• $cast(target_schema, safe = TRUE, options = cast_options(safe)): Alter the schema of the
record batch.
Examples
Description
This can be used in pipelines that pass data back and forth between Arrow and other processes (like
DuckDB).
Usage
to_arrow(.data)
Arguments
.data the object to be converted
to_duckdb 67
Value
an arrow_dplyr_query object, to be used in dplyr pipelines.
Examples
library(dplyr)
ds <- InMemoryDataset$create(mtcars)
ds %>%
filter(mpg < 30) %>%
to_duckdb() %>%
group_by(cyl) %>%
summarize(mean_mpg = mean(mpg, na.rm = TRUE)) %>%
to_arrow() %>%
collect()
Description
This will do the necessary configuration to create a (virtual) table in DuckDB that is backed by the
Arrow object given. No data is copied or modified until collect() or compute() are called or a
query is run against the table.
Usage
to_duckdb(
.data,
con = arrow_duck_connection(),
table_name = unique_arrow_tablename(),
auto_disconnect = FALSE
)
Arguments
.data the Arrow object (e.g. Dataset, Table) to use for the DuckDB table
con a DuckDB connection to use (default will create one and store it in options("arrow_duck_con"))
table_name a name to use in DuckDB for this object. The default is a unique string "arrow_"
followed by numbers.
auto_disconnect
should the table be automatically cleaned up when the resulting object is re-
moved (and garbage collected)? Default: FALSE
68 type
Details
Value
Examples
library(dplyr)
ds <- InMemoryDataset$create(mtcars)
ds %>%
filter(mpg < 30) %>%
to_duckdb() %>%
group_by(cyl) %>%
summarize(mean_mpg = mean(mpg, na.rm = TRUE))
Description
Usage
type(x)
Arguments
x an R vector
Value
Examples
type(1:10)
type(1L:10L)
type(c(1, 1.5, 2))
type(c("A", "B", "C"))
type(mtcars)
type(Sys.Date())
Description
Usage
Arguments
Value
A Schema with the union of fields contained in the inputs, or NULL if any of schemas is NULL
Examples
Description
This function tabulates the values in the array and returns a table of counts.
Usage
value_counts(x)
Arguments
x Array or ChunkedArray
Value
A StructArray containing "values" (same type as x) and "counts" Int64.
Examples
Description
Apache Arrow defines two formats for serializing data for interprocess communication (IPC): a
"stream" format and a "file" format, known as Feather. write_ipc_stream() and write_feather()
write those formats, respectively.
Usage
write_arrow(x, sink, ...)
Arguments
x data.frame, RecordBatch, or Table
sink A string file path, URI, or OutputStream, or path in a file system (SubTreeFileSystem)
... extra parameters passed to write_feather().
write_csv_arrow 71
Details
write_arrow(), a wrapper around write_ipc_stream() and write_feather() with some non-
standard behavior, is deprecated. You should explicitly choose the function that will write the
desired IPC format (stream or file) since either can be written to a file or OutputStream.
Value
x, invisibly.
See Also
write_feather() for writing IPC files. write_to_raw() to serialize data to a buffer. Record-
BatchWriter for a lower-level interface.
Examples
tf <- tempfile()
on.exit(unlink(tf))
write_ipc_stream(mtcars, tf)
Description
Write CSV file to disk
Usage
write_csv_arrow(x, sink, include_header = TRUE, batch_size = 1024L)
Arguments
x data.frame, RecordBatch, or Table
sink A string file path, URI, or OutputStream, or path in a file system (SubTreeFileSystem)
include_header Whether to write an initial header line with column names
batch_size Maximum number of rows processed at a time. Default is 1024.
Value
The input x, invisibly. Note that if sink is an OutputStream, the stream will be left open.
72 write_dataset
Examples
tf <- tempfile()
on.exit(unlink(tf))
write_csv_arrow(mtcars, tf)
Description
This function allows you to write a dataset. By writing to more efficient binary storage formats, and
by specifying relevant partitioning, you can make it much faster to read and query.
Usage
write_dataset(
dataset,
path,
format = c("parquet", "feather", "arrow", "ipc", "csv"),
partitioning = dplyr::group_vars(dataset),
basename_template = paste0("part-{i}.", as.character(format)),
hive_style = TRUE,
existing_data_behavior = c("overwrite", "error", "delete_matching"),
...
)
Arguments
dataset Dataset, RecordBatch, Table, arrow_dplyr_query, or data.frame. If an arrow_dplyr_query,
the query will be evaluated and the result will be written. This means that you
can select(), filter(), mutate(), etc. to transform the data before it is writ-
ten if you need to.
path string path, URI, or SubTreeFileSystem referencing a directory to write to
(directory will be created if it does not exist)
format a string identifier of the file format. Default is to use "parquet" (see FileFormat)
partitioning Partitioning or a character vector of columns to use as partition keys (to be
written as path segments). Default is to use the current group_by() columns.
basename_template
string template for the names of files to be written. Must contain "{i}", which
will be replaced with an autoincremented integer to generate basenames of datafiles.
For example, "part-{i}.feather" will yield "part-0.feather", ....
hive_style logical: write partition segments as Hive-style (key1=value1/key2=value2/file.ext)
or as just bare values. Default is TRUE.
write_dataset 73
existing_data_behavior
The behavior to use when there is already data in the destination directory. Must
be one of "overwrite", "error", or "delete_matching".
• "overwrite" (the default) then any new files created will overwrite existing
files
• "error" then the operation will fail if the destination directory is not empty
• "delete_matching" then the writer will delete any existing partitions if data
is going to be written to those partitions and will leave alone partitions
which data is not written to.
... additional format-specific arguments. For available Parquet options, see write_parquet().
The available Feather options are
• use_legacy_format logical: write data formatted so that Arrow libraries
versions 0.14 and lower can read it. Default is FALSE. You can also enable
this by setting the environment variable ARROW_PRE_0_15_IPC_FORMAT=1.
• metadata_version: A string like "V5" or the equivalent integer indicating
the Arrow IPC MetadataVersion. Default (NULL) will use the latest ver-
sion, unless the environment variable ARROW_PRE_1_0_METADATA_VERSION=1,
in which case it will be V4.
• codec: A Codec which will be used to compress body buffers of written
files. Default (NULL) will not compress body buffers.
• null_fallback: character to be used in place of missing values (NA or
NULL) when using Hive-style partitioning. See hive_partition().
Value
The input dataset, invisibly
Examples
# You can write datasets partitioned by the values in a column (here: "cyl").
# This creates a structure of the form cyl=X/part-Z.parquet.
one_level_tree <- tempfile()
write_dataset(mtcars, one_level_tree, partitioning = "cyl")
list.files(one_level_tree, recursive = TRUE)
# You can obtain the same result as as the previous examples using arrow with
74 write_feather
# a dplyr pipeline. This will be the same as two_levels_tree above, but the
# output directory will be different.
library(dplyr)
two_levels_tree_2 <- tempfile()
mtcars %>%
group_by(cyl, gear) %>%
write_dataset(two_levels_tree_2)
list.files(two_levels_tree_2, recursive = TRUE)
# And you can also turn off the Hive-style directory naming where the column
# name is included with the values by using `hive_style = FALSE`.
Description
Feather provides binary columnar serialization for data frames. It is designed to make reading and
writing data frames efficient, and to make sharing data across data analysis languages easy. This
function writes both the original, limited specification of the format and the version 2 specification,
which is the Apache Arrow IPC file format.
Usage
write_feather(
x,
sink,
version = 2,
chunk_size = 65536L,
compression = c("default", "lz4", "uncompressed", "zstd"),
compression_level = NULL
)
Arguments
x data.frame, RecordBatch, or Table
sink A string file path, URI, or OutputStream, or path in a file system (SubTreeFileSystem)
version integer Feather file version. Version 2 is the current. Version 1 is the more
limited legacy format.
write_parquet 75
chunk_size For V2 files, the number of rows that each chunk of data should have in the file.
Use a smaller chunk_size when you need faster random row access. Default is
64K. This option is not supported for V1.
compression Name of compression codec to use, if any. Default is "lz4" if LZ4 is available
in your build of the Arrow C++ library, otherwise "uncompressed". "zstd" is the
other available codec and generally has better compression ratios in exchange
for slower read and write performance See codec_is_available(). This op-
tion is not supported for V1.
compression_level
If compression is "zstd", you may specify an integer compression level. If
omitted, the compression codec’s default compression level is used.
Value
The input x, invisibly. Note that if sink is an OutputStream, the stream will be left open.
See Also
RecordBatchWriter for lower-level access to writing Arrow IPC data.
Schema for information about schemas and metadata handling.
Examples
tf <- tempfile()
on.exit(unlink(tf))
write_feather(mtcars, tf)
Description
Parquet is a columnar storage file format. This function enables you to write Parquet files from R.
Usage
write_parquet(
x,
sink,
chunk_size = NULL,
version = NULL,
compression = default_parquet_compression(),
compression_level = NULL,
use_dictionary = NULL,
write_statistics = NULL,
76 write_parquet
data_page_size = NULL,
use_deprecated_int96_timestamps = FALSE,
coerce_timestamps = NULL,
allow_truncated_timestamps = FALSE,
properties = NULL,
arrow_properties = NULL
)
Arguments
x data.frame, RecordBatch, or Table
sink A string file path, URI, or OutputStream, or path in a file system (SubTreeFileSystem)
chunk_size chunk size in number of rows. If NULL, the total number of rows is used.
version parquet version, "1.0" or "2.0". Default "1.0". Numeric values are coerced to
character.
compression compression algorithm. Default "snappy". See details.
compression_level
compression level. Meaning depends on compression algorithm
use_dictionary Specify if we should use dictionary encoding. Default TRUE
write_statistics
Specify if we should write statistics. Default TRUE
data_page_size Set a target threshold for the approximate encoded size of data pages within a
column chunk (in bytes). Default 1 MiB.
use_deprecated_int96_timestamps
Write timestamps to INT96 Parquet format. Default FALSE.
coerce_timestamps
Cast timestamps a particular resolution. Can be NULL, "ms" or "us". Default
NULL (no casting)
allow_truncated_timestamps
Allow loss of data when coercing timestamps to a particular resolution. E.g. if
microsecond or nanosecond data is lost when coercing to "ms", do not raise an
exception
properties A ParquetWriterProperties object, used instead of the options enumerated in
this function’s signature. Providing properties as an argument is deprecated; if
you need to assemble ParquetWriterProperties outside of write_parquet(),
use ParquetFileWriter instead.
arrow_properties
A ParquetArrowWriterProperties object. Like properties, this argument
is deprecated.
Details
Due to features of the format, Parquet files cannot be appended to. If you want to use the Parquet for-
mat but also want the ability to extend your dataset, you can write to additional Parquet files and then
treat the whole directory of files as a Dataset you can query. See vignette("dataset",package =
"arrow") for examples of this.
write_to_raw 77
• The default NULL leaves the parameter unspecified, and the C++ library uses an appropriate
default for each column (defaults listed above)
• A single, unnamed, value (e.g. a single string for compression) applies to all columns
• An unnamed vector, of the same size as the number of columns, to specify a value for each
column, in positional order
• A named vector, to specify the value for the named columns, the default value for the setting
is used when not supplied
The compression argument can be any of the following (case insensitive): "uncompressed", "snappy",
"gzip", "brotli", "zstd", "lz4", "lzo" or "bz2". Only "uncompressed" is guaranteed to be available,
but "snappy" and "gzip" are almost always included. See codec_is_available(). The default
"snappy" is used if available, otherwise "uncompressed". To disable compression, set compression
= "uncompressed". Note that "uncompressed" columns may still have dictionary encoding.
Value
the input x invisibly.
Examples
# using compression
if (codec_is_available("gzip")) {
tf2 <- tempfile(fileext = ".gz.parquet")
write_parquet(data.frame(x = 1:5), tf2, compression = "gzip", compression_level = 5)
}
Description
write_ipc_stream() and write_feather() write data to a sink and return the data (data.frame,
RecordBatch, or Table) they were given. This function wraps those so that you can serialize data
to a buffer and access that buffer as a raw vector in R.
Usage
write_to_raw(x, format = c("stream", "file"))
78 write_to_raw
Arguments
x data.frame, RecordBatch, or Table
format one of c("stream","file"), indicating the IPC format to use
Value
A raw vector containing the bytes of the IPC serialized data.
Examples
79
80 INDEX