Scikit-Learn Integration in The Python Ecosystem

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

Scikit-Learn: Machine Learning in the Python ecosystem

Gilles Louppe University of Liège, Belgium


Gaël Varoquaux Parietal, INRIA Saclay, France

Scikit-Learn Integration in the Python ecosystem


The scikit-learn 12 project [4] is an increasingly pop- The library has been designed to tie in with standard
ular machine learning library written in Python. It open source tools of the scientific Python ecosystem.
is designed to be simple and efficient, useful to both In particular, scikit-learn leverages NumPy [6] for ef-
experts and non-experts, and reusable in a variety of ficient storage and manipulation of multi-dimensional
contexts. The primary aim of the project is to provide arrays, and SciPy [3] for more specialized data struc-
a compendium of efficient implementations of classic, tures (e.g. sparse matrices) and implementations of
well-established machine learning algorithms. Among lower-level scientific algorithms. The scikit-learn API
other things, it includes classical supervised and unsu- is designed to avoid the proliferation of framework
pervised learning algorithms, tools for model evalua- code: it is limited and non-intrusive. As such it makes
tion and selection, as well as tools for data preprocess- scikit-learn easy to use and easy to combine with other
ing and feature engineering. scikit-learn is distributed libraries. Together with IPython [5] for interactive ex-
under the 3-clause BSD license, encouraging its free ploration and Matplotlib [2] for dynamic data visual-
use in both commercial and academic settings. ization, NumPy and SciPy constitute a comprehen-
sive scientific working environment that scikit-learn
Started in 2007, scikit-learn is developed by an in-
smoothly complements with a host of machine learning
ternational team of over a dozen core developers,
algorithms and data analysis routines.
mostly researchers from various fields of science. The
project also benefits from many occasional contribu-
tors proposing small bugfixes or improvements. De- Demonstrations
velopment proceeds on GitHub, which greatly facili-
This presentation will illustrate the use of scikit-learn
tates this kind of collaboration. Because of the large
as a component of the larger scientific Python environ-
number of developers, emphasis is put on keeping the
ment to solve complex data analysis tasks. Examples
project maintainable. Code must follow quality guide-
will include end-to-end workflows based on powerful
lines, such as style consistency and unit-test coverage.
and popular algorithms in the library. Among others,
Documentation and examples are required for all fea-
we will show how to use out-of-core learning with on-
tures, and major changes must pass code review by
the-fly feature extraction to tackle very large natural
developers not involved in the proposed change.
language processing tasks, how to exploit an IPython
All algorithms within scikit-learn are offered through a cluster for distributed cross-validation, or how to build
simple and elegant API [1] consisting of a well-defined and use random forests to explore biological data.
set of methods. This API consistency across the pack-
age makes it very usable in practice: experimenting References
with different learning algorithm is as simple as sub-
[1] L. Buitinck et al. API design for machine learning software:
stituting a class definition. Through composition in- experiences from the scikit-learn project. In ECML/PKDD
terfaces, the library also offers powerful mechanisms to Workshop: Languages for Data Mining and Machine Learn-
ing, 2013.
express a wide variety of learning tasks within a small
[2] J. D. Hunter. Matplotlib: A 2d graphics environment. CiSE, 9:
amount of easy-to-read code. Finally, through duck- 90, 2007.
typing, the consistent API leads to a library that is [3] T. E. Oliphant. Python for scientific computing. CiSE, 9:10,
2007.
easily extensible, and allows user-defined estimators to
[4] F. Pedregosa et al. Scikit-learn: Machine learning in Python.
be incorporated into the scikit-learn workflow without JMLR, 12:2825–2830, 2011.
any explicit object inheritance. [5] F. Perez and B. E. Granger. IPython: a system for interactive
scientific computing. CiSE, 9:21, 2007.
1
http://scikit-learn.org [6] S. van der Walt et al. The NumPy array: a structure for efficient
2
http://mloss.org/software/view/240 numerical computation. CiSE, 13:22, 2011.

You might also like