What is the Big O Complexity of Reversing the Order of Columns in Pandas DataFrame?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



What is the Big O Complexity of Reversing the Order of Columns in Pandas DataFrame?



So lets say I have a DataFrame in pandas with a m rows and n columns. Let's also say that I wanted to reverse the order of the columns, which can be done with the following code:


df_reversed = df[df.columns[::-1]]



What is the Big O complexity of this operation? I'm assuming this would depend on the number of columns, but would it also depend on the number of rows?





Note to self: go for the lowest orders of magnitude when setting up such tests :) crashed the laptop.
– roganjosh
Jul 23 at 19:45






If you are going for performance, go with slicing df.iloc[:,::-1], which returns a view and hence should be virtually free as opposed to df[df.columns[::-1]] that creates a copy as you are indexing in the latter one.
– Divakar
Jul 30 at 13:11



df.iloc[:,::-1]


df[df.columns[::-1]]





@Divakar, As a general rule, is this true of only iloc, or does loc also return views? Probably beyond the scope of a single comment, but I'm also interested in why direct indexing via df[col_list] should return a copy (is it a design choice / side-effect / is there any benefit?).
– jpp
Jul 30 at 15:11



iloc


loc


df[col_list]





@divakar if I return a view, can I still do operations on this and then once again reverse the order of the columns and end up with the original dataframe with the operations applied?
– Tim Holdsworth
Aug 3 at 21:32





@TimHoldsworth Once you do operations, you create a copy.
– Divakar
Aug 3 at 21:43




3 Answers
3



I don't know how Pandas implements this, but I did test it empirically. I ran the following code (in a Jupyter notebook) to test the speed of the operation:


def get_dummy_df(n):
return pd.DataFrame('a': [1,2]*n, 'b': [4,5]*n, 'c': [7,8]*n)

df = get_dummy_df(100)
print df.shape
%timeit df_r = df[df.columns[::-1]]

df = get_dummy_df(1000)
print df.shape
%timeit df_r = df[df.columns[::-1]]

df = get_dummy_df(10000)
print df.shape
%timeit df_r = df[df.columns[::-1]]

df = get_dummy_df(100000)
print df.shape
%timeit df_r = df[df.columns[::-1]]

df = get_dummy_df(1000000)
print df.shape
%timeit df_r = df[df.columns[::-1]]

df = get_dummy_df(10000000)
print df.shape
%timeit df_r = df[df.columns[::-1]]



The output was:


(200, 3)
1000 loops, best of 3: 419 µs per loop
(2000, 3)
1000 loops, best of 3: 425 µs per loop
(20000, 3)
1000 loops, best of 3: 498 µs per loop
(200000, 3)
100 loops, best of 3: 2.66 ms per loop
(2000000, 3)
10 loops, best of 3: 25.2 ms per loop
(20000000, 3)
1 loop, best of 3: 207 ms per loop



As you can see, in the first 3 cases, the overhead of the operation is what takes most of the time (400-500µs), but from the 4th case, the time it takes starts to be proportional to the amount of data, increasing in an order of magnitude each time.



So, assuming there must also be a proportion to n, it seems that we are dealing with O(m*n)



The Big O complexity (as of Pandas 0.24) is m*n where m is the number of columns and n is the number of rows. Note, this is when using the DataFrame.__getitem__ method (aka ) with an Index (see relevant code, with other types that would trigger a copy).


m*n


m


n


DataFrame.__getitem__



Index



Here is a helpful stack trace:


<ipython-input-4-3162cae03863>(2)<module>()
1 columns = df.columns[::-1]
----> 2 df_reversed = df[columns]

pandas/core/frame.py(2682)__getitem__()
2681 # either boolean or fancy integer index
-> 2682 return self._getitem_array(key)
2683 elif isinstance(key, DataFrame):

pandas/core/frame.py(2727)_getitem_array()
2726 indexer = self.loc._convert_to_indexer(key, axis=1)
-> 2727 return self._take(indexer, axis=1)
2728

pandas/core/generic.py(2789)_take()
2788 axis=self._get_block_manager_axis(axis),
-> 2789 verify=True)
2790 result = self._constructor(new_data).__finalize__(self)

pandas/core/internals.py(4539)take()
4538 return self.reindex_indexer(new_axis=new_labels, indexer=indexer,
-> 4539 axis=axis, allow_dups=True)
4540

pandas/core/internals.py(4421)reindex_indexer()
4420 new_blocks = self._slice_take_blocks_ax0(indexer,
-> 4421 fill_tuple=(fill_value,))
4422 else:

pandas/core/internals.py(1254)take_nd()
1253 new_values = algos.take_nd(values, indexer, axis=axis,
-> 1254 allow_fill=False)
1255 else:

> pandas/core/algorithms.py(1658)take_nd()
1657 import ipdb; ipdb.set_trace()
-> 1658 func = _get_take_nd_function(arr.ndim, arr.dtype, out.dtype, axis=axis,
1659 mask_info=mask_info)
1660 func(arr, indexer, out, fill_value)



The func call on L1660 in pandas/core/algorithms ultimately calls a cython function with O(m * n) complexity. This is where data from the the original data is copied into out. out contains a copy of the original data in reversed order.


func


pandas/core/algorithms


O(m * n)


out


out


inner_take_2d_axis0_template = """
cdef:
Py_ssize_t i, j, k, n, idx
%(c_type_out)s fv

n = len(indexer)
k = values.shape[1]

fv = fill_value

IF %(can_copy)s:
cdef:
%(c_type_out)s *v
%(c_type_out)s *o

#GH3130
if (values.strides[1] == out.strides[1] and
values.strides[1] == sizeof(%(c_type_out)s) and
sizeof(%(c_type_out)s) * n >= 256):

for i from 0 <= i < n:
idx = indexer[i]
if idx == -1:
for j from 0 <= j < k:
out[i, j] = fv
else:
v = &values[idx, 0]
o = &out[i, 0]
memmove(o, v, <size_t>(sizeof(%(c_type_out)s) * k))
return

for i from 0 <= i < n:
idx = indexer[i]
if idx == -1:
for j from 0 <= j < k:
out[i, j] = fv
else:
for j from 0 <= j < k:
out[i, j] = %(preval)svalues[idx, j]%(postval)s
"""



Note that in the above template function, there is a path that uses memmove (which is the path taken in this case because we are mapping from int64 to int64 and the dimension of the output is identical as we are just switching the indexes). Note that memmove is still O(n), being proportional to the number of bytes it has to copy, although likely faster than writing to the indexes directly.


memmove


int64


int64


memmove





I went ahead and added some more examples which show more context into the path taken after calling __getitem__ and the cython function called, which ultimately is where most of the time is spent for large values. The logic in __getitem__ is not always intuitive, but I found this GH issue to be helpful for explaining what different inputs do under the hood github.com/pandas-dev/pandas/issues/9595.
– akosel
Aug 6 at 14:22



__getitem__


__getitem__





Thank you, this goes a long way to explaining the behaviour we see.
– jpp
Aug 7 at 9:48




I ran an empirical test using big_O fitting library here


big_O



Note: All tests have been conducted on independent variable sweeping 6 orders of magnitude (i.e.


rows


10


10^6


column


3


columns


10


10^6


row


10



The result shows that the columns reverse operation .columns[::-1] complexity in the DataFrame is


columns


.columns[::-1]


DataFrame


O(n^3)


rows


O(n^3)


columns



Prerequisites: You will need to install big_o() using terminal command pip install big_o


big_o()


pip install big_o



Code


import big_o
import pandas as pd
import numpy as np

SWEAP_LOG10 = 6
COLUMNS = 3
ROWS = 10

def build_df(rows, columns):
# To isolated the creation of the DataFrame from the inversion operation.
narray = np.zeros(rows*columns).reshape(rows, columns)
df = pd.DataFrame(narray)
return df

def flip_columns(df):
return df[df.columns[::-1]]

def get_row_df(n, m=COLUMNS):
return build_df(1*10**n, m)

def get_column_df(n, m=ROWS):
return build_df(m, 1*10**n)


# infer the big_o on columns[::-1] operation vs. rows
best, others = big_o.big_o(flip_columns, get_row_df, min_n=1, max_n=SWEAP_LOG10,n_measures=SWEAP_LOG10, n_repeats=10)

# print results
print('Measuring .columns[::-1] complexity against rapid increase in # rows')
print('-'*80 + 'nBig O() fits: n'.format(best) + '-'*80)

for class_, residual in others.items():
print(':<60s (res: :.2G)'.format(str(class_), residual))

print('-'*80)

# infer the big_o on columns[::-1] operation vs. columns
best, others = big_o.big_o(flip_columns, get_column_df, min_n=1, max_n=SWEAP_LOG10,n_measures=SWEAP_LOG10, n_repeats=10)

# print results
print()
print('Measuring .columns[::-1] complexity against rapid increase in # columns')
print('-'*80 + 'nBig O() fits: n'.format(best) + '-'*80)

for class_, residual in others.items():
print(':<60s (res: :.2G)'.format(str(class_), residual))

print('-'*80)



Results


Measuring .columns[::-1] complexity against rapid increase in # rows
--------------------------------------------------------------------------------
Big O() fits: Cubic: time = -0.017 + 0.00067*n^3
--------------------------------------------------------------------------------
Constant: time = 0.032 (res: 0.021)
Linear: time = -0.051 + 0.024*n (res: 0.011)
Quadratic: time = -0.026 + 0.0038*n^2 (res: 0.0077)
Cubic: time = -0.017 + 0.00067*n^3 (res: 0.0052)
Polynomial: time = -6.3 * x^1.5 (res: 6)
Logarithmic: time = -0.026 + 0.053*log(n) (res: 0.015)
Linearithmic: time = -0.024 + 0.012*n*log(n) (res: 0.0094)
Exponential: time = -7 * 0.66^n (res: 3.6)
--------------------------------------------------------------------------------


Measuring .columns[::-1] complexity against rapid increase in # columns
--------------------------------------------------------------------------------
Big O() fits: Cubic: time = -0.28 + 0.009*n^3
--------------------------------------------------------------------------------
Constant: time = 0.38 (res: 3.9)
Linear: time = -0.73 + 0.32*n (res: 2.1)
Quadratic: time = -0.4 + 0.052*n^2 (res: 1.5)
Cubic: time = -0.28 + 0.009*n^3 (res: 1.1)
Polynomial: time = -6 * x^2.2 (res: 16)
Logarithmic: time = -0.39 + 0.71*log(n) (res: 2.8)
Linearithmic: time = -0.38 + 0.16*n*log(n) (res: 1.8)
Exponential: time = -7 * 1^n (res: 9.7)
--------------------------------------------------------------------------------





You're neglecting overhead at small orders of magnitude
– Mad Physicist
Aug 4 at 19:12





And yes, technically if it's O(n), O(n^3) applies too, but it's not very useful.
– Mad Physicist
Aug 4 at 19:13





@MadPhysicist can you elaborate on overhead on small orders of magnitude
– helcode
Aug 4 at 19:15





The other answer does a great job of that.
– Mad Physicist
Aug 4 at 19:17





Your just doing a fit to a limited number of points, and the low tail is throwing you off. The other answer is actually looking at the big O. I seriously doubt that any basic operation in pandas is O(n^3)
– Mad Physicist
Aug 4 at 19:20






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard