Fastest method to remove Empty rows and Columns From Excel Files using Interop

Hadi picture Hadi · Nov 13, 2016 · Viewed 21.7k times · Source

I have a lot of excel files that contains data and it contains empty rows and empty columns. like shown bellow

Excel preview

I am trying to remove Empty rows and columns from excel using interop. I create a simple winform application and used the following code and it works fine.

Dim lstFiles As New List(Of String)
lstFiles.AddRange(IO.Directory.GetFiles(m_strFolderPath, "*.xls", IO.SearchOption.AllDirectories))

Dim m_XlApp = New Excel.Application
Dim m_xlWrkbs As Excel.Workbooks = m_XlApp.Workbooks
Dim m_xlWrkb As Excel.Workbook

For Each strFile As String In lstFiles
    m_xlWrkb = m_xlWrkbs.Open(strFile)
    Dim m_XlWrkSheet As Excel.Worksheet = m_xlWrkb.Worksheets(1)
    Dim intRow As Integer = 1

    While intRow <= m_XlWrkSheet.UsedRange.Rows.Count
        If m_XlApp.WorksheetFunction.CountA(m_XlWrkSheet.Cells(intRow, 1).EntireRow) = 0 Then
            m_XlWrkSheet.Cells(intRow, 1).EntireRow.Delete(Excel.XlDeleteShiftDirection.xlShiftUp)
        Else
            intRow += 1
        End If
    End While

    Dim intCol As Integer = 1
    While intCol <= m_XlWrkSheet.UsedRange.Columns.Count
        If m_XlApp.WorksheetFunction.CountA(m_XlWrkSheet.Cells(1, intCol).EntireColumn) = 0 Then
            m_XlWrkSheet.Cells(1, intCol).EntireColumn.Delete(Excel.XlDeleteShiftDirection.xlShiftToLeft)
        Else
            intCol += 1
        End If
    End While
Next

m_xlWrkb.Save()
m_xlWrkb.Close(SaveChanges:=True)

Marshal.ReleaseComObject(m_xlWrkb)
Marshal.ReleaseComObject(m_xlWrkbs)
m_XlApp.Quit()
Marshal.ReleaseComObject(m_XlApp)

But when cleaning big excel files it takes a lot of time. Any suggestions for optimizing this code? or another way to clean this excel files faster? Is there a function that can delete empty rows in one click?

I don't have problem if answers are using C#

EDIT:

I uploaded a sample file Sample File. But not all files have same structure.

Answer

JohnG picture JohnG · Nov 21, 2016

I found that looping through the excel worksheet can take some time if the worksheet is large. So my solution tried to avoid any looping in the worksheet. To avoid looping through the worksheet, I made a 2 dimensional object array from the cells returned from usedRange with:

Excel.Range targetCells = worksheet.UsedRange;
object[,] allValues = (object[,])targetCells.Cells.Value;

This is the array I loop through to get the indexes of the empty rows and columns. I make 2 int lists, one keeps the row indexes to delete the other keeps the column indexes to delete.

List<int> emptyRows = GetEmptyRows(allValues, totalRows, totalCols);
List<int> emptyCols = GetEmptyCols(allValues, totalRows, totalCols);

These lists will be sorted from high to low to simplify deleting rows from the bottom up and deleting columns from right to left. Then simply loop through each list and delete the appropriate row/col.

DeleteRows(emptyRows, worksheet);
DeleteCols(emptyCols, worksheet);

Finally after all the empty rows and columns have been deleted, I SaveAs the file to a new file name.

Hope this helps.

EDIT:

Addressed the UsedRange issue such that if there are empty rows at the top of the worksheet, those rows will now be removed. Also this will remove any empty columns to the left of the starting data. This allows for the indexing to work properly even if there are empty rows or columns before the data starts. This was accomplished by taking the address of the first cell in UsedRange this will be an address of the form “$A$1:$D$4”. This will allow the use of an offset if the empty rows at the top and empty columns to the left are to remain and not be deleted. In this case I am simply deleting them. To get the number of rows to delete from the top can be calculated by the first “$A$4” address where the “4” is the row that the first data appears. So we need to delete the top 3 rows. The Column address is of the form “A”, “AB” or even “AAD” this required some translation and thanks to How to convert a column number (eg. 127) into an excel column (eg. AA) I was able to determine how many columns on the left need to be deleted.

class Program {
  static void Main(string[] args) {
    Excel.Application excel = new Excel.Application();
    string originalPath = @"H:\ExcelTestFolder\Book1_Test.xls";
    Excel.Workbook workbook = excel.Workbooks.Open(originalPath);
    Excel.Worksheet worksheet = workbook.Worksheets["Sheet1"];
    Excel.Range usedRange = worksheet.UsedRange;

    RemoveEmptyTopRowsAndLeftCols(worksheet, usedRange);

    DeleteEmptyRowsCols(worksheet);

    string newPath = @"H:\ExcelTestFolder\Book1_Test_Removed.xls";
    workbook.SaveAs(newPath, Excel.XlSaveAsAccessMode.xlNoChange);

    workbook.Close();
    excel.Quit();
    System.Runtime.InteropServices.Marshal.ReleaseComObject(workbook);
    System.Runtime.InteropServices.Marshal.ReleaseComObject(excel);
    Console.WriteLine("Finished removing empty rows and columns - Press any key to exit");
    Console.ReadKey();
  }

  private static void DeleteEmptyRowsCols(Excel.Worksheet worksheet) {
    Excel.Range targetCells = worksheet.UsedRange;
    object[,] allValues = (object[,])targetCells.Cells.Value;
    int totalRows = targetCells.Rows.Count;
    int totalCols = targetCells.Columns.Count;

    List<int> emptyRows = GetEmptyRows(allValues, totalRows, totalCols);
    List<int> emptyCols = GetEmptyCols(allValues, totalRows, totalCols);

    // now we have a list of the empty rows and columns we need to delete
    DeleteRows(emptyRows, worksheet);
    DeleteCols(emptyCols, worksheet);
  }

  private static void DeleteRows(List<int> rowsToDelete, Excel.Worksheet worksheet) {
    // the rows are sorted high to low - so index's wont shift
    foreach (int rowIndex in rowsToDelete) {
      worksheet.Rows[rowIndex].Delete();
    }
  }

  private static void DeleteCols(List<int> colsToDelete, Excel.Worksheet worksheet) {
    // the cols are sorted high to low - so index's wont shift
    foreach (int colIndex in colsToDelete) {
      worksheet.Columns[colIndex].Delete();
    }
  }

  private static List<int> GetEmptyRows(object[,] allValues, int totalRows, int totalCols) {
    List<int> emptyRows = new List<int>();

    for (int i = 1; i < totalRows; i++) {
      if (IsRowEmpty(allValues, i, totalCols)) {
        emptyRows.Add(i);
      }
    }
    // sort the list from high to low
    return emptyRows.OrderByDescending(x => x).ToList();
  }

  private static List<int> GetEmptyCols(object[,] allValues, int totalRows, int totalCols) {
    List<int> emptyCols = new List<int>();

    for (int i = 1; i < totalCols; i++) {
      if (IsColumnEmpty(allValues, i, totalRows)) {
        emptyCols.Add(i);
      }
    }
    // sort the list from high to low
    return emptyCols.OrderByDescending(x => x).ToList();
  }

  private static bool IsColumnEmpty(object[,] allValues, int colIndex, int totalRows) {
    for (int i = 1; i < totalRows; i++) {
      if (allValues[i, colIndex] != null) {
        return false;
      }
    }
    return true;
  }

  private static bool IsRowEmpty(object[,] allValues, int rowIndex, int totalCols) {
    for (int i = 1; i < totalCols; i++) {
      if (allValues[rowIndex, i] != null) {
        return false;
      }
    }
    return true;
  }

  private static void RemoveEmptyTopRowsAndLeftCols(Excel.Worksheet worksheet, Excel.Range usedRange) {
    string addressString = usedRange.Address.ToString();
    int rowsToDelete = GetNumberOfTopRowsToDelete(addressString);
    DeleteTopEmptyRows(worksheet, rowsToDelete);
    int colsToDelete = GetNumberOfLeftColsToDelte(addressString);
    DeleteLeftEmptyColumns(worksheet, colsToDelete);
  }

  private static void DeleteTopEmptyRows(Excel.Worksheet worksheet, int startRow) {
    for (int i = 0; i < startRow - 1; i++) {
      worksheet.Rows[1].Delete();
    }
  }

  private static void DeleteLeftEmptyColumns(Excel.Worksheet worksheet, int colCount) {
    for (int i = 0; i < colCount - 1; i++) {
      worksheet.Columns[1].Delete();
    }
  }

  private static int GetNumberOfTopRowsToDelete(string address) {
    string[] splitArray = address.Split(':');
    string firstIndex = splitArray[0];
    splitArray = firstIndex.Split('$');
    string value = splitArray[2];
    int returnValue = -1;
    if ((int.TryParse(value, out returnValue)) && (returnValue >= 0))
      return returnValue;
    return returnValue;
  }

  private static int GetNumberOfLeftColsToDelte(string address) {
    string[] splitArray = address.Split(':');
    string firstindex = splitArray[0];
    splitArray = firstindex.Split('$');
    string value = splitArray[1];
    return ParseColHeaderToIndex(value);
  }

  private static int ParseColHeaderToIndex(string colAdress) {
    int[] digits = new int[colAdress.Length];
    for (int i = 0; i < colAdress.Length; ++i) {
      digits[i] = Convert.ToInt32(colAdress[i]) - 64;
    }
    int mul = 1; int res = 0;
    for (int pos = digits.Length - 1; pos >= 0; --pos) {
      res += digits[pos] * mul;
      mul *= 26;
    }
    return res;
  }
}

EDIT 2: For testing I made a method that loops thru the the worksheet and compared it to my code that loops thru an object array. It shows a significant difference.

enter image description here

Method to Loop thru the worksheet and delete empty rows and columns.

enum RowOrCol { Row, Column };
private static void ConventionalRemoveEmptyRowsCols(Excel.Worksheet worksheet) {
  Excel.Range usedRange = worksheet.UsedRange;
  int totalRows = usedRange.Rows.Count;
  int totalCols = usedRange.Columns.Count;

  RemoveEmpty(usedRange, RowOrCol.Row);
  RemoveEmpty(usedRange, RowOrCol.Column);
}

private static void RemoveEmpty(Excel.Range usedRange, RowOrCol rowOrCol) {
  int count;
  Excel.Range curRange;
  if (rowOrCol == RowOrCol.Column)
    count = usedRange.Columns.Count;
  else
    count = usedRange.Rows.Count;

  for (int i = count; i > 0; i--) {
    bool isEmpty = true;
    if (rowOrCol == RowOrCol.Column)
      curRange = usedRange.Columns[i];
    else
      curRange = usedRange.Rows[i];

    foreach (Excel.Range cell in curRange.Cells) {
      if (cell.Value != null) {
        isEmpty = false;
        break; // we can exit this loop since the range is not empty
      }
      else {
        // Cell value is null contiue checking
      }
    } // end loop thru each cell in this range (row or column)

    if (isEmpty) {
      curRange.Delete();
    }
  }
}

Then a Main for testing/timing the two methods.

enum RowOrCol { Row, Column };

static void Main(string[] args)
{
  Excel.Application excel = new Excel.Application();
  string originalPath = @"H:\ExcelTestFolder\Book1_Test.xls";
  Excel.Workbook workbook = excel.Workbooks.Open(originalPath);
  Excel.Worksheet worksheet = workbook.Worksheets["Sheet1"];
  Excel.Range usedRange = worksheet.UsedRange;

  // Start test for looping thru each excel worksheet
  Stopwatch sw = new Stopwatch();
  Console.WriteLine("Start stopwatch to loop thru WORKSHEET...");
  sw.Start();
  ConventionalRemoveEmptyRowsCols(worksheet);
  sw.Stop();
  Console.WriteLine("It took a total of: " + sw.Elapsed.Milliseconds + " Miliseconds to remove empty rows and columns...");

  string newPath = @"H:\ExcelTestFolder\Book1_Test_RemovedLoopThruWorksheet.xls";
  workbook.SaveAs(newPath, Excel.XlSaveAsAccessMode.xlNoChange);
  workbook.Close();
  Console.WriteLine("");

  // Start test for looping thru object array
  workbook = excel.Workbooks.Open(originalPath);
  worksheet = workbook.Worksheets["Sheet1"];
  usedRange = worksheet.UsedRange;
  Console.WriteLine("Start stopwatch to loop thru object array...");
  sw = new Stopwatch();
  sw.Start();
  DeleteEmptyRowsCols(worksheet);
  sw.Stop();

  // display results from second test
  Console.WriteLine("It took a total of: " + sw.Elapsed.Milliseconds + " Miliseconds to remove empty rows and columns...");
  string newPath2 = @"H:\ExcelTestFolder\Book1_Test_RemovedLoopThruArray.xls";
  workbook.SaveAs(newPath2, Excel.XlSaveAsAccessMode.xlNoChange);
  workbook.Close();
  excel.Quit();
  System.Runtime.InteropServices.Marshal.ReleaseComObject(workbook);
  System.Runtime.InteropServices.Marshal.ReleaseComObject(excel);
  Console.WriteLine("");
  Console.WriteLine("Finished testing methods - Press any key to exit");
  Console.ReadKey();
}

EDIT 3 As per OP request... I updated and changed the code to match the OP code. With this I found some interesting results. See below.

I changed the code to match the functions you are using ie… EntireRow and CountA. The code below I found that it preforms terribly. Running some tests I found the code below was in the 800+ milliseconds execution time. However one subtle change made a huge difference.

On the line:

while (rowIndex <= worksheet.UsedRange.Rows.Count)

This is slowing things down a lot. If you create a range variable for UsedRang and not keep regrabbibg it with each iteration of the while loop will make a huge difference. So… when I change the while loop to…

Excel.Range usedRange = worksheet.UsedRange;
int rowIndex = 1;

while (rowIndex <= usedRange.Rows.Count)
and
while (colIndex <= usedRange.Columns.Count)

This performed very close to my object array solution. I did not post the results, as you can use the code below and change the while loop to grab the UsedRange with each iteration or use the variable usedRange to test this.

private static void RemoveEmptyRowsCols3(Excel.Worksheet worksheet) {
  //Excel.Range usedRange = worksheet.UsedRange;     // <- using this variable makes the while loop much faster 
  int rowIndex = 1;

  // delete empty rows
  //while (rowIndex <= usedRange.Rows.Count)     // <- changing this one line makes a huge difference - not grabbibg the UsedRange with each iteration...
  while (rowIndex <= worksheet.UsedRange.Rows.Count) {
    if (excel.WorksheetFunction.CountA(worksheet.Cells[rowIndex, 1].EntireRow) == 0) {
      worksheet.Cells[rowIndex, 1].EntireRow.Delete(Excel.XlDeleteShiftDirection.xlShiftUp);
    }
    else {
      rowIndex++;
    }
  }

  // delete empty columns
  int colIndex = 1;
  // while (colIndex <= usedRange.Columns.Count) // <- change here also

  while (colIndex <= worksheet.UsedRange.Columns.Count) {
    if (excel.WorksheetFunction.CountA(worksheet.Cells[1, colIndex].EntireColumn) == 0) {
      worksheet.Cells[1, colIndex].EntireColumn.Delete(Excel.XlDeleteShiftDirection.xlShiftToLeft);
    }
    else {
      colIndex++;
    }
  }
}

UPDATE by @Hadi

You can alter DeleteCols and DeleteRows function to get better performance if excel contains extra blank rows and columns after the last used ones:

private static void DeleteRows(List<int> rowsToDelete, Microsoft.Office.Interop.Excel.Worksheet worksheet)
{
    // the rows are sorted high to low - so index's wont shift

    List<int> NonEmptyRows = Enumerable.Range(1, rowsToDelete.Max()).ToList().Except(rowsToDelete).ToList();

    if (NonEmptyRows.Max() < rowsToDelete.Max())
    {

        // there are empty rows after the last non empty row

        Microsoft.Office.Interop.Excel.Range cell1 = worksheet.Cells[NonEmptyRows.Max() + 1,1];
        Microsoft.Office.Interop.Excel.Range cell2 = worksheet.Cells[rowsToDelete.Max(), 1];

        //Delete all empty rows after the last used row
        worksheet.Range[cell1, cell2].EntireRow.Delete(Microsoft.Office.Interop.Excel.XlDeleteShiftDirection.xlShiftUp);


    }    //else last non empty row = worksheet.Rows.Count



    foreach (int rowIndex in rowsToDelete.Where(x => x < NonEmptyRows.Max()))
    {
        worksheet.Rows[rowIndex].Delete();
    }
}

private static void DeleteCols(List<int> colsToDelete, Microsoft.Office.Interop.Excel.Worksheet worksheet)
{
    // the cols are sorted high to low - so index's wont shift

    //Get non Empty Cols
    List<int> NonEmptyCols = Enumerable.Range(1, colsToDelete.Max()).ToList().Except(colsToDelete).ToList();

    if (NonEmptyCols.Max() < colsToDelete.Max())
    {

        // there are empty rows after the last non empty row

        Microsoft.Office.Interop.Excel.Range cell1 = worksheet.Cells[1,NonEmptyCols.Max() + 1];
        Microsoft.Office.Interop.Excel.Range cell2 = worksheet.Cells[1,NonEmptyCols.Max()];

        //Delete all empty rows after the last used row
        worksheet.Range[cell1, cell2].EntireColumn.Delete(Microsoft.Office.Interop.Excel.XlDeleteShiftDirection.xlShiftToLeft);


    }            //else last non empty column = worksheet.Columns.Count

    foreach (int colIndex in colsToDelete.Where(x => x < NonEmptyCols.Max()))
    {
        worksheet.Columns[colIndex].Delete();
    }
}

check my answer at Get Last non empty column and row index from excel using Interop