How to extract text from MS office documents in C#

Elias Haileselassie picture Elias Haileselassie · Jun 18, 2009 · Viewed 67.5k times · Source

I was trying to extract a text(string) from MS Word (.doc, .docx), Excel and Powerpoint using C#. Where can i find a free and simple .Net library to read MS Office documents? I tried to use NPOI but i didn't get a sample about how to use NPOI.


KyleM picture KyleM · Dec 28, 2011

For Microsoft Word 2007 and Microsoft Word 2010 (.docx) files you can use the Open XML SDK. This snippet of code will open a document and return its contents as text. It is especially useful for anyone trying to use regular expressions to parse the contents of a Word document. To use this solution you would need reference DocumentFormat.OpenXml.dll, which is part of the OpenXML SDK.


 public static string TextFromWord(SPFile file)
        const string wordmlNamespace = "";

        StringBuilder textBuilder = new StringBuilder();
        using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(file.OpenBinaryStream(), false))
            // Manage namespaces to perform XPath queries.  
            NameTable nt = new NameTable();
            XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
            nsManager.AddNamespace("w", wordmlNamespace);

            // Get the document part from the package.  
            // Load the XML in the document part into an XmlDocument instance.  
            XmlDocument xdoc = new XmlDocument(nt);

            XmlNodeList paragraphNodes = xdoc.SelectNodes("//w:p", nsManager);
            foreach (XmlNode paragraphNode in paragraphNodes)
                XmlNodeList textNodes = paragraphNode.SelectNodes(".//w:t", nsManager);
                foreach (System.Xml.XmlNode textNode in textNodes)

        return textBuilder.ToString();