Get an HTML page as XML code

Guy Daher picture Guy Daher · Feb 9, 2012 · Viewed 13.7k times · Source

I just learnt about how to parse data in Xcode using NSXMLPARSER.

In order to do that, obviously, I will need xml files, but I am still a beginner with web programming.

I am having difficulties getting an xml file from a web page. I tried to convert html to xml using some softwares but I am still not getting the format I want.

The format that I want should be similar to this:

<?xml version="1.0" encoding="UTF-8"?>
<Books>
    <Book id="1">
        <title>Circumference</title>
        <author>Nicholas Nicastro</author>
        <summary>Eratosthenes and the Ancient Quest to Measure the Globe.</summary>
    </Book>
    <Book id="2">
        <title>Copernicus Secret</title>
        <author>Jack Repcheck</author>
        <summary>How the scientific revolution began</summary>
    </Book>
</Books>

So how can I get a format like this from a webpage?

And one more thing: If someone knows about NSXMLPARSER using Xcode, is this the way to go to extract data from websites? I mean getting an xml file, putting it in the resource of our project and then extracting the data from it?

Answer

Paaske picture Paaske · Mar 28, 2012

HTML is also XML. So if you want to extract data from any given website, you will need to get the HTML (the source of the page) and parse it "as is", then look for the data you need.

A simple website may look like this:

<html>
  <head>
    <title>My website</title>
  </head>
  <body>
    <h1>welocome</h1>
    Text
    <p>paragraph</p>
  </body>
</html>

As you can see, this is valid, wellformed XML. If you are interested in the <title>, parse this XML and look for the <title>-tag.

The problem is that browsers are not so strict with the wellformedness of HTML. A missing end tag for <p> is often tolerated. An XML-parser would normally not be that "nice" and produce an error.

Very often websites has rss/atom-feeds. These are pure XML and are always wellformed. These feeds are made for the purpose of getting data that is easily interpreted by XML parsers.