Parse HTML with Swiftsoup (Swift)?

Vincent picture Vincent · Feb 24, 2018 · Viewed 8.1k times · Source

I'm trying to parse some websites with Swiftsoup, let's say one of the websites is from Medium. How can I extract the body of the website and load the body to another UIViewController like what Instapaper does?

enter image description here

Here is the code I use to extract the title:

import SwiftSoup

class WebViewController: UIViewController, UIWebViewDelegate {

...

override func viewDidLoad() {
        super.viewDidLoad()

        let url = URL(string: "https://medium.com/@timjwise/stop-lying-to-yourself-when-you-snub-panhandlers-its-not-for-their-own-good-199d0aa7a513")
        let request = URLRequest(url: url!)
        webView.loadRequest(request)

        guard let myURL = url else {
        print("Error: \(String(describing: url)) doesn't seem to be a valid URL")
            return
        }
        let html = try! String(contentsOf: myURL, encoding: .utf8)

        do {
            let doc: Document = try SwiftSoup.parseBodyFragment(html)
            let headerTitle = try doc.title()
            print("Header title: \(headerTitle)")
        } catch Exception.Error(let type, let message) {
            print("Message: \(message)")
        } catch {
            print("error")
        }

}

}

But I got no luck to extract the body of the website or any other websites, any way to get it work? CSS or JavaScript (I know nothing about CSS or Javascript)?

Answer

Scinfu picture Scinfu · Mar 7, 2018

Use function body https://github.com/scinfu/SwiftSoup#parsing-a-body-fragment Try this:

let html = try! String(contentsOf: myURL, encoding: .utf8)

    do {
        let doc: Document = try SwiftSoup.parseBodyFragment(html)
        let headerTitle = try doc.title()

        // my body
        let body = doc.body()
        // elements to remove, in this case images
        let undesiredElements: Elements? = try body?.select("img[src]")
        //remove
        undesiredElements?.remove()


        print("Header title: \(headerTitle)")
    } catch Exception.Error(let type, let message) {
        print("Message: \(message)")
    } catch {
        print("error")
    }