How to bypass bot detection and scrape a website using python

Andy_ye picture Andy_ye · Apr 24, 2020 · Viewed 8.6k times · Source

The problem

I was new to web scraping and I was trying to create a scraper which looks at a playlist link and gets the list of the music and the author.

But the site kept rejecting my connection because it thought that I was a bot, so I used UserAgent to create a fake useragent string to try and bypass the filter.

It sort of worked? But the problem was that when you visited the website by a browser, you could see the contents of the playlist, but when you tried to extract the html code with requests, the contents of the playlist was just a big blank space.

Mabye I have to wait for the page to load? Or there is a stronger bot filter?

My code

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

ua = UserAgent()

melon_site="http://kko.to/IU8zwNmjM"

headers = {'User-Agent' : ua.random}
result = requests.get(melon_site, headers = headers)


print(result.status_code)
src = result.content
soup = BeautifulSoup(src,'html.parser')
print(soup)

Link of website

playlist link

html I get when using requests

html with blank space where the playlist was supposed to be

Answer

Sharyar Vohra picture Sharyar Vohra · Apr 24, 2020

POINT TO REMEMBERS WHILE SCRAPING


1)Use a good User Agent.. ua.random may be returning you a user agent which is being Blocked by the server

2) If you are Doing Too much scraping, limit down your scraping pace , use time.sleep() so that server may not get loaded by your Ip address else it will block you.

3) If server blocks you try using Ip rotating.