What is the Big O Complexity of Reversing the Order of Columns in Pandas DataFrame?
Clash Royale CLAN TAG#URR8PPP
What is the Big O Complexity of Reversing the Order of Columns in Pandas DataFrame?
So lets say I have a DataFrame in pandas with a m rows and n columns. Let's also say that I wanted to reverse the order of the columns, which can be done with the following code:
df_reversed = df[df.columns[::-1]]
What is the Big O complexity of this operation? I'm assuming this would depend on the number of columns, but would it also depend on the number of rows?
If you are going for performance, go with slicing
df.iloc[:,::-1]
, which returns a view and hence should be virtually free as opposed to df[df.columns[::-1]]
that creates a copy as you are indexing in the latter one.– Divakar
Jul 30 at 13:11
df.iloc[:,::-1]
df[df.columns[::-1]]
@Divakar, As a general rule, is this true of only
iloc
, or does loc
also return views? Probably beyond the scope of a single comment, but I'm also interested in why direct indexing via df[col_list]
should return a copy (is it a design choice / side-effect / is there any benefit?).– jpp
Jul 30 at 15:11
iloc
loc
df[col_list]
@divakar if I return a view, can I still do operations on this and then once again reverse the order of the columns and end up with the original dataframe with the operations applied?
– Tim Holdsworth
Aug 3 at 21:32
@TimHoldsworth Once you do operations, you create a copy.
– Divakar
Aug 3 at 21:43
3 Answers
3
I don't know how Pandas implements this, but I did test it empirically. I ran the following code (in a Jupyter notebook) to test the speed of the operation:
def get_dummy_df(n):
return pd.DataFrame('a': [1,2]*n, 'b': [4,5]*n, 'c': [7,8]*n)
df = get_dummy_df(100)
print df.shape
%timeit df_r = df[df.columns[::-1]]
df = get_dummy_df(1000)
print df.shape
%timeit df_r = df[df.columns[::-1]]
df = get_dummy_df(10000)
print df.shape
%timeit df_r = df[df.columns[::-1]]
df = get_dummy_df(100000)
print df.shape
%timeit df_r = df[df.columns[::-1]]
df = get_dummy_df(1000000)
print df.shape
%timeit df_r = df[df.columns[::-1]]
df = get_dummy_df(10000000)
print df.shape
%timeit df_r = df[df.columns[::-1]]
The output was:
(200, 3)
1000 loops, best of 3: 419 µs per loop
(2000, 3)
1000 loops, best of 3: 425 µs per loop
(20000, 3)
1000 loops, best of 3: 498 µs per loop
(200000, 3)
100 loops, best of 3: 2.66 ms per loop
(2000000, 3)
10 loops, best of 3: 25.2 ms per loop
(20000000, 3)
1 loop, best of 3: 207 ms per loop
As you can see, in the first 3 cases, the overhead of the operation is what takes most of the time (400-500µs), but from the 4th case, the time it takes starts to be proportional to the amount of data, increasing in an order of magnitude each time.
So, assuming there must also be a proportion to n, it seems that we are dealing with O(m*n)
The Big O complexity (as of Pandas 0.24) is m*n
where m
is the number of columns and n
is the number of rows. Note, this is when using the DataFrame.__getitem__
method (aka ) with an
Index
(see relevant code, with other types that would trigger a copy).
m*n
m
n
DataFrame.__getitem__
Index
Here is a helpful stack trace:
<ipython-input-4-3162cae03863>(2)<module>()
1 columns = df.columns[::-1]
----> 2 df_reversed = df[columns]
pandas/core/frame.py(2682)__getitem__()
2681 # either boolean or fancy integer index
-> 2682 return self._getitem_array(key)
2683 elif isinstance(key, DataFrame):
pandas/core/frame.py(2727)_getitem_array()
2726 indexer = self.loc._convert_to_indexer(key, axis=1)
-> 2727 return self._take(indexer, axis=1)
2728
pandas/core/generic.py(2789)_take()
2788 axis=self._get_block_manager_axis(axis),
-> 2789 verify=True)
2790 result = self._constructor(new_data).__finalize__(self)
pandas/core/internals.py(4539)take()
4538 return self.reindex_indexer(new_axis=new_labels, indexer=indexer,
-> 4539 axis=axis, allow_dups=True)
4540
pandas/core/internals.py(4421)reindex_indexer()
4420 new_blocks = self._slice_take_blocks_ax0(indexer,
-> 4421 fill_tuple=(fill_value,))
4422 else:
pandas/core/internals.py(1254)take_nd()
1253 new_values = algos.take_nd(values, indexer, axis=axis,
-> 1254 allow_fill=False)
1255 else:
> pandas/core/algorithms.py(1658)take_nd()
1657 import ipdb; ipdb.set_trace()
-> 1658 func = _get_take_nd_function(arr.ndim, arr.dtype, out.dtype, axis=axis,
1659 mask_info=mask_info)
1660 func(arr, indexer, out, fill_value)
The func
call on L1660 in pandas/core/algorithms
ultimately calls a cython function with O(m * n)
complexity. This is where data from the the original data is copied into out
. out
contains a copy of the original data in reversed order.
func
pandas/core/algorithms
O(m * n)
out
out
inner_take_2d_axis0_template = """
cdef:
Py_ssize_t i, j, k, n, idx
%(c_type_out)s fv
n = len(indexer)
k = values.shape[1]
fv = fill_value
IF %(can_copy)s:
cdef:
%(c_type_out)s *v
%(c_type_out)s *o
#GH3130
if (values.strides[1] == out.strides[1] and
values.strides[1] == sizeof(%(c_type_out)s) and
sizeof(%(c_type_out)s) * n >= 256):
for i from 0 <= i < n:
idx = indexer[i]
if idx == -1:
for j from 0 <= j < k:
out[i, j] = fv
else:
v = &values[idx, 0]
o = &out[i, 0]
memmove(o, v, <size_t>(sizeof(%(c_type_out)s) * k))
return
for i from 0 <= i < n:
idx = indexer[i]
if idx == -1:
for j from 0 <= j < k:
out[i, j] = fv
else:
for j from 0 <= j < k:
out[i, j] = %(preval)svalues[idx, j]%(postval)s
"""
Note that in the above template function, there is a path that uses memmove
(which is the path taken in this case because we are mapping from int64
to int64
and the dimension of the output is identical as we are just switching the indexes). Note that memmove
is still O(n), being proportional to the number of bytes it has to copy, although likely faster than writing to the indexes directly.
memmove
int64
int64
memmove
I went ahead and added some more examples which show more context into the path taken after calling
__getitem__
and the cython function called, which ultimately is where most of the time is spent for large values. The logic in __getitem__
is not always intuitive, but I found this GH issue to be helpful for explaining what different inputs do under the hood github.com/pandas-dev/pandas/issues/9595.– akosel
Aug 6 at 14:22
__getitem__
__getitem__
Thank you, this goes a long way to explaining the behaviour we see.
– jpp
Aug 7 at 9:48
I ran an empirical test using big_O
fitting library here
big_O
Note: All tests have been conducted on independent variable sweeping 6 orders of magnitude (i.e.
rows
10
10^6
column
3
columns
10
10^6
row
10
The result shows that the columns
reverse operation .columns[::-1]
complexity in the DataFrame
is
columns
.columns[::-1]
DataFrame
O(n^3)
rows
O(n^3)
columns
Prerequisites: You will need to install big_o()
using terminal command pip install big_o
big_o()
pip install big_o
Code
import big_o
import pandas as pd
import numpy as np
SWEAP_LOG10 = 6
COLUMNS = 3
ROWS = 10
def build_df(rows, columns):
# To isolated the creation of the DataFrame from the inversion operation.
narray = np.zeros(rows*columns).reshape(rows, columns)
df = pd.DataFrame(narray)
return df
def flip_columns(df):
return df[df.columns[::-1]]
def get_row_df(n, m=COLUMNS):
return build_df(1*10**n, m)
def get_column_df(n, m=ROWS):
return build_df(m, 1*10**n)
# infer the big_o on columns[::-1] operation vs. rows
best, others = big_o.big_o(flip_columns, get_row_df, min_n=1, max_n=SWEAP_LOG10,n_measures=SWEAP_LOG10, n_repeats=10)
# print results
print('Measuring .columns[::-1] complexity against rapid increase in # rows')
print('-'*80 + 'nBig O() fits: n'.format(best) + '-'*80)
for class_, residual in others.items():
print(':<60s (res: :.2G)'.format(str(class_), residual))
print('-'*80)
# infer the big_o on columns[::-1] operation vs. columns
best, others = big_o.big_o(flip_columns, get_column_df, min_n=1, max_n=SWEAP_LOG10,n_measures=SWEAP_LOG10, n_repeats=10)
# print results
print()
print('Measuring .columns[::-1] complexity against rapid increase in # columns')
print('-'*80 + 'nBig O() fits: n'.format(best) + '-'*80)
for class_, residual in others.items():
print(':<60s (res: :.2G)'.format(str(class_), residual))
print('-'*80)
Results
Measuring .columns[::-1] complexity against rapid increase in # rows
--------------------------------------------------------------------------------
Big O() fits: Cubic: time = -0.017 + 0.00067*n^3
--------------------------------------------------------------------------------
Constant: time = 0.032 (res: 0.021)
Linear: time = -0.051 + 0.024*n (res: 0.011)
Quadratic: time = -0.026 + 0.0038*n^2 (res: 0.0077)
Cubic: time = -0.017 + 0.00067*n^3 (res: 0.0052)
Polynomial: time = -6.3 * x^1.5 (res: 6)
Logarithmic: time = -0.026 + 0.053*log(n) (res: 0.015)
Linearithmic: time = -0.024 + 0.012*n*log(n) (res: 0.0094)
Exponential: time = -7 * 0.66^n (res: 3.6)
--------------------------------------------------------------------------------
Measuring .columns[::-1] complexity against rapid increase in # columns
--------------------------------------------------------------------------------
Big O() fits: Cubic: time = -0.28 + 0.009*n^3
--------------------------------------------------------------------------------
Constant: time = 0.38 (res: 3.9)
Linear: time = -0.73 + 0.32*n (res: 2.1)
Quadratic: time = -0.4 + 0.052*n^2 (res: 1.5)
Cubic: time = -0.28 + 0.009*n^3 (res: 1.1)
Polynomial: time = -6 * x^2.2 (res: 16)
Logarithmic: time = -0.39 + 0.71*log(n) (res: 2.8)
Linearithmic: time = -0.38 + 0.16*n*log(n) (res: 1.8)
Exponential: time = -7 * 1^n (res: 9.7)
--------------------------------------------------------------------------------
You're neglecting overhead at small orders of magnitude
– Mad Physicist
Aug 4 at 19:12
And yes, technically if it's O(n), O(n^3) applies too, but it's not very useful.
– Mad Physicist
Aug 4 at 19:13
@MadPhysicist can you elaborate on overhead on small orders of magnitude
– helcode
Aug 4 at 19:15
The other answer does a great job of that.
– Mad Physicist
Aug 4 at 19:17
Your just doing a fit to a limited number of points, and the low tail is throwing you off. The other answer is actually looking at the big O. I seriously doubt that any basic operation in pandas is O(n^3)
– Mad Physicist
Aug 4 at 19:20
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Note to self: go for the lowest orders of magnitude when setting up such tests :) crashed the laptop.
– roganjosh
Jul 23 at 19:45