Get domain out of any URL string

March 2011
M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Posted: March 20, 2011 in 1, How To, Programming, Tutorials
Tags: Domain name, PHP, programming, Top-level domain, Uniform Resource Locator

It’s a common problem with no single right answer: extract the top domain (e.g. example.com) from a given string, which may or may not be a valid URL. I had need of such functionality recently and found answers around the web lacking. So if you ever “just wanted the domain name” out of a string, give this a shot…



<?php
function get_top_domain($url, $remove_subdomains = 'all') {
  $host = strtolower(parse_url($url, PHP_URL_HOST));
  if ($host == '') $host = $url;
  switch ($remove_subdomains) {
    case 'www':
      if (strpos($host, 'www.') === 0) {
        $host = substr($host, 4);
      }
      return $host;
    case 'all':
    default:
      if (substr_count($host, '.') > 1) {
        preg_match("/^.+\.([a-z0-9\.\-]+\.[a-z]{2,4})$/", $host, $host);
        if (isset($host[1])) {
          return $host[1];
        } else {
          // not a valid domain
          return false;
        }
      } else {
        return $host;
      }
    break;
  }
}// some examples
var_dump(get_top_domain('http://www.validurl.example.com/directory', 'all'));
var_dump(get_top_domain('http://www.validurl.example.com/directory', 'www'));
var_dump(get_top_domain('domain-string.example.com', 'all'));
var_dump(get_top_domain('domain-string.example.com/nowfails', 'all'));
var_dump(get_top_domain('finds the domain url.example.com', 'all'));
var_dump(get_top_domain('12.34.56.78', 'all'));
?>


Most of the examples are simply proofs, but I want to draw attention to the string in example #4, 'domain-string.example.com/nowfails'. This is not a valid URL, so the call to parse_url() fails, forcing the script to use the entire original string. In turn,  the path part of the string causes the regex to break, causing a  complete failout (return false;).
Is there a way to account for this? Surely, however I’m not about to  tap that massive keg of exceptions (i.e. just a slash, slash plus path,  slash plus another domain in a human-readable string, etc).
No regex for validating URL’s or email addresses is ever perfect; the  “strict” RFC requirements are too damn broad. So I did what I always  do: chose “what works” over “what’s technically right.” This one  requires any 2-4 characters for a the top level domain (TLD), so it  doesn’t allow for the .museum TLD, and doesn’t check to see if the  provided TLD is actually valid. If you need to do further verification,  that’s on you. Here’s the current full list of valid TLD’s provided by the IANA.
If you need to modify the regex at all, I highly recommend you read this article about email address regex first for two reasons:

There’s a ton of overlap between email and URL regex matching
It will point out all the gotcha’s in your “better” regex theory that you didn’t think about

		
			0.000000
			0.000000
		
Rate this:
Share this:

				Share on Facebook (Opens in new window)
				Facebook
			

				Share on X (Opens in new window)
				X
			

				Share on LinkedIn (Opens in new window)
				LinkedIn
			

				Email a link to a friend (Opens in new window)
				Email
			

				Share on Reddit (Opens in new window)
				Reddit
			

				Print (Opens in new window)
				Print
			
Like Loading...


	Related

NetWidZ..!

FOLLOW US ON

Become a Freelancer

Flickr Photos

Select a Category

NetWidZ

Achievements

Email Subscription

VISITOR MAP

FB PROFILE

STATUS

Twitter Updates

Search

Get domain out of any URL string

Leave a comment Cancel reply

Recent Articles

Top Clicks

Author