I am using Python 3.4 and xlrd. I want to sort the Excel sheet based on the primary column before processing it. Is there any library to perform this ?
There are a couple ways to do this. The first option is to utilize xlrd
, as you have this tagged. The biggest downside to this is that it doesn't natively write to XLSX format.
These examples use an excel document with this format:
Utilizing xlrd
and a few modifications from this answer:
import xlwt
from xlrd import open_workbook
target_column = 0 # This example only has 1 column, and it is 0 indexed
book = open_workbook('test.xlsx')
sheet = book.sheets()[0]
data = [sheet.row_values(i) for i in xrange(sheet.nrows)]
labels = data[0] # Don't sort our headers
data = data[1:] # Data begins on the second row
data.sort(key=lambda x: x[target_column])
bk = xlwt.Workbook()
sheet = bk.add_sheet(sheet.name)
for idx, label in enumerate(labels):
sheet.write(0, idx, label)
for idx_r, row in enumerate(data):
for idx_c, value in enumerate(row):
sheet.write(idx_r+1, idx_c, value)
bk.save('result.xls') # Notice this is xls, not xlsx like the original file is
This outputs the following workbook:
Another option (and one that can utilize XLSX output) is to utilize pandas
. The code is also shorter:
import pandas as pd
xl = pd.ExcelFile("test.xlsx")
df = xl.parse("Sheet1")
df = df.sort(columns="Header Row")
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer,sheet_name='Sheet1',columns=["Header Row"],index=False)
writer.save()
This outputs:
In the to_excel
call, the index
is set to False
, so that the Pandas dataframe index isn't included in the excel document. The rest of the keywords should be self explanatory.