Getting parts of a URL (Regex)

Question 1

Getting parts of a URL (Regex)

regex language-agnostic url

pek · Aug 26, 2008 · Viewed 247.6k times · Source

Answer

Answer

A single regex to parse and breakup a full URL including query parameters and anchors e.g.

https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$

RexEx positions:

url: RegExp['$&'],

protocol:RegExp.$2,

host:RegExp.$3,

path:RegExp.$4,

file:RegExp.$6,

query:RegExp.$7,

hash:RegExp.$8

you could then further parse the host ('.' delimited) quite easily.

What I would do is use something like this:

/*
    ^(.*:)//([A-Za-z0-9\-\.]+)(:[0-9]+)?(.*)$
*/
proto $1
host $2
port $3
the-rest $4

the further parse 'the rest' to be as specific as possible. Doing it in one regex is, well, a bit crazy.

Question 2

Given the URL (single line):
http://test.example.com/dir/subdir/file.html

How can I extract the following parts using regular expressions:

The Subdomain (test)
The Domain (example.com)
The path without the file (/dir/subdir/)
The file (file.html)
The path with the file (/dir/subdir/file.html)
The URL without the path (http://test.example.com)
(add any other that you think would be useful)

The regex should work correctly even if I enter the following URL:

http://example.example.com/example/example/example.html

Getting parts of a URL (Regex)

Answer

Related questions