Split html row into string array

fearofawhackplanet picture fearofawhackplanet · Aug 4, 2010 · Viewed 13.6k times · Source

I have data in an html file, in a table:

<table>
    <tr><td>001</td><td>MC Hammer</td><td>Can't Touch This</td></tr>
    <tr><td>002</td><td>Tone Loc</td><td>Funky Cold Medina</td></tr>
    <tr><td>003</td><td>Funkdoobiest</td><td>Bow Wow Wow</td></tr>
</table>

How do I split a single row into an array or list?

string row = streamReader.ReadLine();

List<string> data = row.Split //... how do I do this bit?

string artist = data[1];

Answer

3Dave picture 3Dave · Aug 4, 2010

Short answer: never try to parse HTML from the wild with regular expressions. It will most likely come back to haunt you.

Longer answer: As long as you can absolutely, positively guarantee that the HTML that you are parsing fits the given structure, you can use string.Split() as Jenni suggested.

string html = "<tr><td>001</td><td>MC Hammer</td><td>Can't Touch This</td></tr>";

string[] values = html.Split(new string[] { "<tr>","</tr>","<td>","</td>" }, StringSplitOptions.RemoveEmptyEntries);

List<string> list = new List<string>(values);

Listing the tags independently keeps this slightly more readable, and the .RemoveEmptyEntries will keep you from getting an empty string in your list between adjacent closing and opening tags.

If this HTML is coming from the wild, or from a tool that may change - in other words, if this is more than a one-off transaction - I strongly encourage you to use something like the HTML Agility Pack instead. It's pretty easy to integrate, and there are lots of examples on the Intarwebs.