I have a site with 2000 pages and I want to iterate through each page to generate a sitemap, using the file_get_html()
function and regular expressions.
Obviously this can't be completed in one server-side execution as it will run out of time due to maximum execution time. I guess it needs to perform smaller actions, save the progress to the database and then queue the next task. Any suggestions?
When you run it command line there will be no maximum execution time.
You can also use set_time_limit(0);
for this if your provider allows manipulation.
I can't tell if your ip-address will get banned - as this depends on the security of the server you send your requests to.
Other solution
You can fetch one (or a few) page(s), and search for new URLs throughout the source code. You can then queue these in a database. Then on the next run, you process the queue.