Skip to content
/ Darr Public

A Python library for numpy arrays that persist on disk in a format that is simple, self-documented and tool-independent, and maximizes universal readability.

License

Notifications You must be signed in to change notification settings

gbeckers/Darr

Repository files navigation

Darr

Github CI Status Appveyor Status PyPi version Conda Forge Codecov Badge Docs Status Zenodo Badge

Darr is a Python science library that allows you to work with potentially very large, disk-based Numpy arrays that are self-documented, and that can be read in many other popular languages for data analysis with minimal effort.

Universal readability of data is a pillar of good scientific practice. It is also generally a good idea for anyone who wants to flexibly move between analysis environments, who wants to save data for the longer term, or who wants to share data with others without spending much time on figuring out and/or explaining how the receiver can read it. As you work with you darr array, its documentation is automatically kept up to date, including a complete and human-readable description, as well as code to read the array in popular languages such as R, Julia, Scilab, IDL, Matlab, Maple, and Mathematica, or in Python/Numpy without Darr (see example). A quick copy-paste of code from the array documentation is in most cases all that is needed to read your data in, e.g. R or Matlab. No need to export anything, make notes, or to provide elaborate explanation. No dependence on complicated formats or specialized libraries. No looking up things.

In essence, Darr makes it trivially easy to share your numerical arrays with others or with yourself when working in different computing environments, and makes them future-proof by providing documentation.

More rationale for a tool-independent approach to numeric array storage is provided here.

Under the hood, Darr uses NumPy memory-mapped arrays, which is a widely established and trusted way of working with disk-based numerical data, and which makes Darr fully NumPy compatible. This enables efficient out-of-core read/write access to potentially very large arrays. In addition to automatic documentation, Darr adds other functionality to NumPy's memmap, such as easy appending and truncating data, support for ragged arrays, the ability to create arrays from iterators, and easy use of metadata. Flat binary files and (JSON) text files are accompanied by a README text file that explains how the array and metadata are stored (see example arrays).

See this tutorial for a brief introduction, or the documentation for more info.

Darr is currently pre-1.0, still undergoing development. It is open source and freely available under the New BSD License terms.

Features

  • Data is stored purely based on flat binary and text files, maximizing universal readability.
  • Automatic self-documention, including copy-paste ready code snippets for reading the array in a number of popular data analysis environments, such as Python (without Darr), R, Julia, Scilab, Octave/Matlab, GDL/IDL, and Mathematica (see example array).
  • Disk-persistent array data is directly accessible through NumPy indexing and may be larger than RAM
  • Easy and efficient appending of data (see example).
  • Supports ragged arrays.
  • Easy use of metadata, stored in a widely readable separate JSON text file (see example).
  • Many numeric types are supported: (u)int8-(u)int64, float16-float64, complex64, complex128.
  • Integrates easily with the Dask library for out-of-core computation on very large arrays.
  • Minimal dependencies, only NumPy.

Drawbacks

  • No compression, although compression for archiving purposes is supported.

Installation

Darr officially depends on Python 3.9 or higher. Older versions may work (probably >= 3.6) but are not tested.

Install Darr from PyPI:

$ pip install darr

Or, install Darr via conda:

$ conda install -c conda-forge darr

To install the latest development version, use pip with the latest GitHub master:

$ pip install git+https://github.com/gbeckers/darr@master

Documentation

See the documentation for more information.

Contributing

Any help / suggestions / ideas / contributions are welcome and very much appreciated. For any comment, question, or error, please open an issue or propose a pull request.

Other interesting projects

If Darr is not exactly what you are looking for, have a look at these projects:

Darr is BSD licensed (BSD 3-Clause License). (c) 2017-2023, Gabriël Beckers