.. _roadmap:

=======
Roadmap
=======

This page provides an overview of the major themes in pandas' development. Each of
these items requires a relatively large amount of effort to implement. These may
be achieved more quickly with dedicated funding or interest from contributors.

An item being on the roadmap does not mean that it will *necessarily* happen, even
with unlimited funding. During the implementation period we may discover issues
preventing the adoption of the feature.

Additionally, an item *not* being on the roadmap does not exclude it from inclusion
in pandas. The roadmap is intended for larger, fundamental changes to the project that
are likely to take months or years of developer time. Smaller-scoped items will continue
to be tracked on our `issue tracker <https://github.com/pandas-dev/pandas/issues>`__.

See :ref:`roadmap.evolution` for proposing changes to this document.

Extensibility
-------------

pandas :ref:`extending.extension-types` allow for extending NumPy types with custom
data types and array storage. pandas uses extension types internally, and provides
an interface for 3rd-party libraries to define their own custom data types.

Many parts of pandas still unintentionally convert data to a NumPy array.
These problems are especially pronounced for nested data.

We'd like to improve the handling of extension arrays throughout the library,
making their behavior more consistent with the handling of NumPy arrays. We'll do this
by cleaning up pandas' internals and adding new methods to the extension array interface.

String data type
----------------

Currently, pandas stores text data in an ``object`` -dtype NumPy array.
The current implementation has two primary drawbacks: First, ``object`` -dtype
is not specific to strings: any Python object can be stored in an ``object`` -dtype
array, not just strings. Second: this is not efficient. The NumPy memory model
isn't especially well-suited to variable width text data.

To solve the first issue, we propose a new extension type for string data. This
will initially be opt-in, with users explicitly requesting ``dtype="string"``.
The array backing this string dtype may initially be the current implementation:
an ``object`` -dtype NumPy array of Python strings.

To solve the second issue (performance), we'll explore alternative in-memory
array libraries (for example, Apache Arrow). As part of the work, we may
need to implement certain operations expected by pandas users (for example
the algorithm used in, ``Series.str.upper``). That work may be done outside of
pandas.

Consistent missing value handling
---------------------------------

Currently, pandas handles missing data differently for different data types. We
use different types to indicate that a value is missing (``np.nan`` for
floating-point data, ``np.nan`` or ``None`` for object-dtype data -- typically
strings or booleans -- with missing values, and ``pd.NaT`` for datetimelike
data). Integer data cannot store missing data or are cast to float. In addition,
pandas 1.0 introduced a new missing value sentinel, ``pd.NA``, which is being
used for the experimental nullable integer, boolean, and string data types.

These different missing values have different behaviors in user-facing
operations. Specifically, we introduced different semantics for the nullable
data types for certain operations (e.g. propagating in comparison operations
instead of comparing as False).

Long term, we want to introduce consistent missing data handling for all data
types. This includes consistent behavior in all operations (indexing, arithmetic
operations, comparisons, etc.). There has been discussion of eventually making
the new semantics the default.

This has been discussed at :issue:`28095` (and
linked issues), and described in more detail in this
`design doc <https://hackmd.io/@jorisvandenbossche/Sk0wMeAmB>`__.

Apache Arrow interoperability
-----------------------------

`Apache Arrow <https://arrow.apache.org>`__ is a cross-language development
platform for in-memory data. The Arrow logical types are closely aligned with
typical pandas use cases.

We'd like to provide better-integrated support for Arrow memory and data types
within pandas. This will let us take advantage of its I/O capabilities and
provide for better interoperability with other languages and libraries
using Arrow.

Block manager rewrite
---------------------

We'd like to replace pandas current internal data structures (a collection of
1 or 2-D arrays) with a simpler collection of 1-D arrays.

pandas internal data model is quite complex. A DataFrame is made up of
one or more 2-dimensional "blocks", with one or more blocks per dtype. This
collection of 2-D arrays is managed by the BlockManager.

The primary benefit of the BlockManager is improved performance on certain
operations (construction from a 2D array, binary operations, reductions across the columns),
especially for wide DataFrames. However, the BlockManager substantially increases the
complexity and maintenance burden of pandas.

By replacing the BlockManager we hope to achieve

* Substantially simpler code
* Easier extensibility with new logical types
* Better user control over memory use and layout
* Improved micro-performance
* Option to provide a C / Cython API to pandas' internals

See `these design documents <https://dev.pandas.io/pandas2/internal-architecture.html#removal-of-blockmanager-new-dataframe-internals>`__
for more.

Decoupling of indexing and internals
------------------------------------

The code for getting and setting values in pandas' data structures needs refactoring.
In particular, we must clearly separate code that converts keys (e.g., the argument
to ``DataFrame.loc``) to positions from code that uses these positions to get
or set values. This is related to the proposed BlockManager rewrite. Currently, the
BlockManager sometimes uses label-based, rather than position-based, indexing.
We propose that it should only work with positional indexing, and the translation of keys
to positions should be entirely done at a higher level.

Indexing is a complicated API with many subtleties. This refactor will require care
and attention. The following principles should inspire refactoring of indexing code and
should result on cleaner, simpler, and more performant code.

1. **Label indexing must never involve looking in an axis twice for the same label(s).**
This implies that any validation step must either:

  * limit validation to general features (e.g. dtype/structure of the key/index), or
  * reuse the result for the actual indexing.

2. **Indexers must never rely on an explicit call to other indexers.**
For instance, it is OK to have some internal method of ``.loc`` call some
internal method of ``__getitem__`` (or of their common base class),
but never in the code flow of ``.loc`` should ``the_obj[something]`` appear.

3. **Execution of positional indexing must never involve labels** (as currently, sadly, happens).
That is, the code flow of a getter call (or a setter call in which the right hand side is non-indexed)
to ``.iloc`` should never involve the axes of the object in any way.

4. **Indexing must never involve accessing/modifying values** (i.e., act on ``._data`` or ``.values``) **more than once.**
The following steps must hence be clearly decoupled:

  * find positions we need to access/modify on each axis
  * (if we are accessing) derive the type of object we need to return (dimensionality)
  * actually access/modify the values
  * (if we are accessing) construct the return object

5. As a corollary to the decoupling between 4.i and 4.iii, **any code which deals on how data is stored**
(including any combination of handling multiple dtypes, and sparse storage, categoricals, third-party types)
**must be independent from code that deals with identifying affected rows/columns**,
and take place only once step 4.i is completed.

  * In particular, such code should most probably not live in ``pandas/core/indexing.py``
  * ... and must not depend in any way on the type(s) of axes (e.g. no ``MultiIndex`` special cases)

6. As a corollary to point 1.i, **``Index`` (sub)classes must provide separate methods for any desired validity check of label(s) which does not involve actual lookup**,
on the one side, and for any required conversion/adaptation/lookup of label(s), on the other.

7. **Use of trial and error should be limited**, and anyway restricted to catch only exceptions
which are actually expected (typically ``KeyError``).

  * In particular, code should never (intentionally) raise new exceptions in the ``except`` portion of a ``try... exception``

8. **Any code portion which is not specific to setters and getters must be shared**,
and when small differences in behavior are expected (e.g. getting with ``.loc`` raises for
missing labels, setting still doesn't), they can be managed with a specific parameter.

Numba-accelerated operations
----------------------------

`Numba <https://numba.pydata.org>`__ is a JIT compiler for Python code. We'd like to provide
ways for users to apply their own Numba-jitted functions where pandas accepts user-defined functions
(for example, :meth:`Series.apply`, :meth:`DataFrame.apply`, :meth:`DataFrame.applymap`,
and in groupby and window contexts). This will improve the performance of
user-defined-functions in these operations by staying within compiled code.

Performance monitoring
----------------------

pandas uses `airspeed velocity <https://asv.readthedocs.io/en/stable/>`__ to
monitor for performance regressions. ASV itself is a fabulous tool, but requires
some additional work to be integrated into an open source project's workflow.

The `asv-runner <https://github.com/asv-runner>`__ organization, currently made up
of pandas maintainers, provides tools built on top of ASV. We have a physical
machine for running a number of project's benchmarks, and tools managing the
benchmark runs and reporting on results.

We'd like to fund improvements and maintenance of these tools to

* Be more stable. Currently, they're maintained on the nights and weekends when
  a maintainer has free time.
* Tune the system for benchmarks to improve stability, following
  https://pyperf.readthedocs.io/en/latest/system.html
* Build a GitHub bot to request ASV runs *before* a PR is merged. Currently, the
  benchmarks are only run nightly.

.. _roadmap.evolution:

Roadmap evolution
-----------------

pandas continues to evolve. The direction is primarily determined by community
interest. Everyone is welcome to review existing items on the roadmap and
to propose a new item.

Each item on the roadmap should be a short summary of a larger design proposal.
The proposal should include

1. Short summary of the changes, which would be appropriate for inclusion in
   the roadmap if accepted.
2. Motivation for the changes.
3. An explanation of why the change is in scope for pandas.
4. Detailed design: Preferably with example-usage (even if not implemented yet)
   and API documentation
5. API Change: Any API changes that may result from the proposal.

That proposal may then be submitted as a GitHub issue, where the pandas maintainers
can review and comment on the design. The `pandas mailing list <https://mail.python.org/mailman/listinfo/pandas-dev>`__
should be notified of the proposal.

When there's agreement that an implementation
would be welcome, the roadmap should be updated to include the summary and a
link to the discussion issue.

Completed items
---------------

This section records now completed items from the pandas roadmap.

Documentation improvements
~~~~~~~~~~~~~~~~~~~~~~~~~~

We improved the pandas documentation

* The pandas community worked with others to build the `pydata-sphinx-theme`_,
  which is now used for https://pandas.pydata.org/docs/ (:issue:`15556`).
* :ref:`getting_started` contains a number of resources intended for new
  pandas users coming from a variety of backgrounds (:issue:`26831`).

.. _pydata-sphinx-theme: https://github.com/pydata/pydata-sphinx-theme
