OpenPYXL Poor Performance Optimisation

Problem:

OpenPYXL is a python library for parsing excel files. It's fantastic and works really well. Plenty of features are supported but my primary use-case is to simply read data from various sheets and convert the contents into a dictionary / list of dictionaries.

The problem is, if you have a massive spreadsheet, this can take AGES. Usually this is down to the code you're using to read the sheet. It's incredible what a difference a few optimisations can make.

Example:

I have a spreadsheet with 20,355 entries and 4 columns. We want to turn this into a list of dictionaries - where each element of the list corresponds to a row, and the key of the dictionary is the heading of that row. This example is from a real problem which I had to deal with (but have simplified for the scope of this test).

The headings of the rows are defined from the first row in the sheet. This ensures that rows can be rearranged or added to without hardcoding a particular row to a particular dataset within code (which is horrible, please please never do that!)

Firstly, I want to highlight that some optimisations have already been made. Specifically the first is to use read only mode when loading the workbook. This can be done by specifying read_only=True in the load_workbook function like so:

workbook = openpyxl.load_workbook(EXCEL_FILE, read_only=True)

Here's the first function used:

def slow_openpyxl():
    start_time = time.time()

    workbook = openpyxl.load_workbook(EXCEL_FILE, data_only=True,read_only=True)
    sheet = workbook['Sheet1']

    max_cols = sheet.max_column+1
    max_rows = sheet.max_row+1

    header = []
    for col_num in range(1, max_cols):
        if sheet.cell(row=1, column=col_num).value is None: 
            break
        else:
            header.append(sheet.cell(row=1, column=col_num).value)

    sheet_list = []
    for row_num in range(2, max_rows):
        new_entry = {}
        for col_num in range(1, max_cols):
            if sheet.cell(row=row_num, column=col_num).value is None: 
                new_entry[header[col_num-1]] = ""
            else:
                new_entry[header[col_num-1]] = sheet.cell(row=row_num, column=col_num).value
               
        print(f'Row {row_num}/{max_rows} Time elapsed: {round(time.time()-start_time,2)}')
        sheet_list.append(new_entry)
    print('Done!')

There are a few bad things about this code, and I'll cover some:

1) The heading list is built separately - this means you need another loop and all it does is fetch the headings. This isn't that bad as it's only one row.

2) We're using the max_row and max_column properties of an excel sheet. This is bad because excel sheets can actually be bigger than they appear, and you end up iterating through lots of empty cells.

3) The really bad part is the use of the sheet.cell(row=x,column=y).value function. This is exceedingly slow.

4) What makes #3 even worse, is we're calling the function multiple times! Once when checking if it's None, and again when actually assigning the value to the dictionary. If using resource hungry functions, assign them to temporary variables instead of multiple function calls.

In general, the benchmark figured spat out the following before I aborted the program:

<trimmed>
Row 367/20357 Time elapsed: 11.19
Row 368/20357 Time elapsed: 11.25
Row 369/20357 Time elapsed: 11.29
<aborted>

So we've done 369 rows in 11.29 seconds. That's 33.68 rows per second. Sounds fast, but not when you have more than 20,000 of them. Unfortunately this is also not a linear processing time. I left it running for a bit longer...

<trimmed>
Row 953/20357 Time elapsed: 60.52
Row 954/20357 Time elapsed: 60.63
Row 955/20357 Time elapsed: 60.74
<aborted>

After 60 seconds we've only done 955 rows. That's now 15.72 rows per second. We're not even 5% of the way through the file and it's slowed down to less than half the speed.

I'm not going to let it run forever, believe it or not I have better things do with my life. But clearly there is a rather catastrophic performance problem here which will result in the full 20,000+ rows taking potentially hours to load. (I'm too lazy to collect more data and too wise to draw a graph from just these two points.)

https://xkcd.com/605/

Solution:

Fortunately, after some (rather basic) research, this problem is easily solvable! The key is to use iterators built into openpyxl.

Consider this new function:

def fast_openpyxl():
    start_time = time.time()

    workbook = openpyxl.load_workbook(EXCEL_FILE, data_only=True,read_only=True)
    sheet = workbook['Sheet1']

    header = []
    sheet_list = []
    for rowidx,row in enumerate(sheet.rows):            
        new_entry = {}
        for colidx,cell in enumerate(row):
            value = cell.value             
            if rowidx == 0:
                if value is None: break
                header.append(value)
            else:
                if value is None: 
                    try:
                        new_entry[header[colidx]] = ""
                    except IndexError: break                      
                else:
                    try:
                        new_entry[header[colidx]] = value
                    except IndexError: break

        print(f'Row {rowidx+1}/{sheet.max_row+1} Time elapsed: {round(time.time()-start_time,2)}')    
        if rowidx != 0:
            sheet_list.append(new_entry)
    
    print(f'Done! Completed in {round(time.time()-start_time,2)} seconds.')

So, what are we doing differently?

We've now got a single double-for loop to iterate over rows then columns. We're also using the inbuilt iterator of sheet.rows to build our loop. We then iterate over the cell in each row.

The initial heading list is built by checking if rowidx is 0 or not, alleviating the need for a separate loop for getting the headings.

We're also using exceptions to catch index errors when inserting data into our large dictionary. This works nicely as the heading list will throw an index error if data goes out of bounds and we then stop reading data for that column.

Finally, we're assigning the value of a cell to a variable, then using that variable for the logic. As with most languages, variables are super fast compared to functions.

Now for what you've been waiting for, juicy juicy numbers!

<trimmed>
Row 20354/20357 Time elapsed: 2.07
Row 20355/20357 Time elapsed: 2.07
Row 20356/20357 Time elapsed: 2.07
Done! Completed in 2.07 seconds.
dchidell@dchidell-mac:Desktop$ 

That's a STAGGERING speed boost! We've taken the ENTIRE execution time to just over two seconds. That's everything, I didn't have to abort it!

Conclusion:

1) Use read_only when opening the file if all you're doing is reading.

2) Use the built in iterators!

I cannot stress this enough - the iterators are fast, crazy fast.

3) Call functions as infrequently as possible and store intermediate data in variables. It may bulk the code up a bit, but it tends to be more efficient and also allows your code to be more readable (but this is icing on the cake compared to points 1 and 2). Python can also be ambiguous as to what is a variable and what is a function; but as a general rule intermediate variables are good for multiple function calls.

Bonus points if you saw my off-by-one error in the timing code relating to the number of rows :)