How to stop reading a spreadsheet with xlrd at the first empty row?

auslander picture auslander · Mar 16, 2017 · Viewed 21.2k times · Source

I'm using xlrd to sniff through a directory structure and pull out spreadsheets, reading the second row (row 1) down to "do stuff." The issue is that I don't know how to stop the reading/printing at the first empty row. I understand that rows are not "empty" objects, but I'd appreciate a little help showing how to check if all cells are empty. Here is the code I'm working with:

import xlrd
import os

def excel_file_filter(filename, extensions=['.xls', '.xlsx']):
    return any(filename.endswith(e) for e in extensions)

def get_filenames(root):
    filename_list = []
    for path, subdirs, files in os.walk(root):
        for filename in filter(excel_file_filter, files):
            filename_list.append(os.path.join(path, filename))
    return filename_list

spreadsheets = get_filenames('C:\\Temp')
for s in spreadsheets:
    with xlrd.open_workbook(s) as wb:
        cs = wb.sheet_by_index(0)
        num_cols = cs.ncols
        for row_index in range(1, cs.nrows):
            print('Row: {}'.format(row_index))
            for col_index in range(0, num_cols):
                cell_object = cs.cell(row_index, col_index)
                if cell_obj is not xlrd.empty_cell:
                    print('Col #: {} | Value: {}'.format(col_index, cell_obj))

What ends up happening is that it prints all the way through almost 1000 rows, when only the first say, 25 rows have content in them. The amount of content between the spreadsheets varies, so a generic solution (that does not rely on other optional libraries) that helps me understand how to detect an empty row and then break, would be appreciated.

Answer

user7609283 picture user7609283 · Mar 20, 2017

First: To get the cell value then check if it is empty, use one of the methods explained in the answer of question How to detect if a cell is empty when reading Excel files using the xlrd library?

  1. When using cell_val= cs.cell(row_index, col_index).value to get the value:
    • To check if it is empty: simply write if cell_vel == ''
  2. When using cell_object = cs.cell(row_index, col_index) to get the value:
    • To check if it is empty:
      -first get the cell_type cell_type = cs.cell_type(row_index, col_index)
      -then check if cell_type == xlrd.XL_CELL_EMPTY

Second: To check if the whole row is empty you can do the following:

  1. Define a counter (count_empty=0) to count number empty cells in the row & boolean (empty_cell = False)
  2. Check if the cell is empty
    If it is > increment counter & change empty_cell to True
    If not > set empty_cell False
  3. Check if empty_cell is False > print the value of the cell
  4. After looping through columns in the row
    If count_empty is equal to the number of columns > means the whole row is empty > break and stop looping through rows

The Code:

# define empty_cell boolean
empty_cell= False
with xlrd.open_workbook(s) as wb:
    cs= wb.sheet_by_index(0)
    num_cols= cs.ncols
    num_rows= cs.nrows
    for row_index in range(1, num_rows):
        # set count empty cells
        count_empty = 0
        print('Row: {}'.format(row_index))
        for col_index in range(0,num_cols):
            # get cell value
            cell_val= cs.cell(row_index, col_index).value
            # check if cell is empty
            if cell_val== '': 
                # set empty cell is True
                empty_cell = True
                # increment counter
                count_empty+= 1
            else:
                # set empty cell is false
                empty_cell= False

            # check if cell is not empty
            if not empty_cell:
                # print value of cell
                print('Col #: {} | Value: {}'.format(col_index, cell_val))

        # check the counter if is = num_cols means the whole row is empty       
        if count_empty == num_cols:
            print ('Row is empty')
            # stop looping to next rows
            break     

Note: I used the first method cell_val= cs.cell(row_index, col_index).value to get the value of cells, I see it's simpler. If you want to use the other method change the following:

    cell_val= cs.cell(row_index, col_index) # remove .value
    cell_type= cs.cell_type(row_index, col_index) # add this line
    # check if cell is empty
    if cell_type == xlrd.XL_CELL_EMPTY: # change if cell_val== '':

Other links that helped me understand how to check if a cell is empty:
xlrd.XL_CELL_EMPTY and Validating a cell value using XLRD