Itâs a common problem with no single right answer: extract the top domain (e.g. example.com) from a given string, which may or may not be a valid URL. I had need of such functionality recently and found answers around the web lacking. So if you ever âjust wanted the domain nameâ out of a string, give this a shotâŚ
<?php
function get_top_domain($url, $remove_subdomains = 'all') {
$host = strtolower(parse_url($url, PHP_URL_HOST));
if ($host == '') $host = $url;
switch ($remove_subdomains) {
case 'www':
if (strpos($host, 'www.') === 0) {
$host = substr($host, 4);
}
return $host;
case 'all':
default:
if (substr_count($host, '.') > 1) {
preg_match("/^.+\.([a-z0-9\.\-]+\.[a-z]{2,4})$/", $host, $host);
if (isset($host[1])) {
return $host[1];
} else {
// not a valid domain
return false;
}
} else {
return $host;
}
break;
}
}// some examples
var_dump(get_top_domain('http://www.validurl.example.com/directory', 'all'));
var_dump(get_top_domain('http://www.validurl.example.com/directory', 'www'));
var_dump(get_top_domain('domain-string.example.com', 'all'));
var_dump(get_top_domain('domain-string.example.com/nowfails', 'all'));
var_dump(get_top_domain('finds the domain url.example.com', 'all'));
var_dump(get_top_domain('12.34.56.78', 'all'));
?>
Most of the examples are simply proofs, but I want to draw attention to the string in example #4, 'domain-string.example.com/nowfails'. This is not a valid URL, so the call to parse_url() fails, forcing the script to use the entire original string. In turn, the path part of the string causes the regex to break, causing a complete failout (return false;).
Is there a way to account for this? Surely, however Iâm not about to tap that massive keg of exceptions (i.e. just a slash, slash plus path, slash plus another domain in a human-readable string, etc).
No regex for validating URLâs or email addresses is ever perfect; the âstrictâ RFC requirements are too damn broad. So I did what I always do: chose âwhat worksâ over âwhatâs technically right.â This one requires any 2-4 characters for a the top level domain (TLD), so it doesnât allow for the .museum TLD, and doesnât check to see if the provided TLD is actually valid. If you need to do further verification, thatâs on you. Hereâs the current full list of valid TLDâs provided by the IANA.
If you need to modify the regex at all, I highly recommend you read this article about email address regex first for two reasons:
- Thereâs a ton of overlap between email and URL regex matching
- It will point out all the gotchaâs in your âbetterâ regex theory that you didnât think about
0.000000
0.000000