Parse Html Document Get All input fields with ID and Value

bumble_bee_tuna picture bumble_bee_tuna · Dec 17, 2013 · Viewed 9k times · Source

I have several thousand (ASP.net - messy html) html generated invoices that I'm trying to parse and save into a database.

Basically like:

 foreach(var htmlDoc in HtmlFolder)
 {
   foreach(var inputBox in htmlDoc)
   { 
      //Make Collection of ID and Values Insert to DB
   }
 }  

From all the other questions I've read the best tool for this type of problem is the HtmlAgilityPack, however for the life of me I can't get the documentation .chm file to work. Any ideas on how I could accomplish this with or without the Agility Pack ?

Thanks in advance

Answer

Arithmomaniac picture Arithmomaniac · Dec 17, 2013

An newer alternative to HtmlAgilityPack is CsQuery. See this later question on its relative performance merits, but its use of CSS selectors can't be beat:

var doc = CQ.CreateDocumentFromFile(htmldoc); //load, parse the file
var fields = doc["input"]; //get input fields with CSS
var pairs = fields.Select(node => new Tuple<string, string>(node.Id, node.Value()))
       //get values