15

For example, the address is:

Address = http://lol1.domain.com:8888/some/page

I want to save the subdomain into a variable so i could do like so;

print SubAddr
>> lol1
1

12 Answers 12

32

Package tldextract makes this task very easy, and then you can use urlparse as suggested if you need any further information:

>>> import tldextract
>>> tldextract.extract("http://lol1.domain.com:8888/some/page"
ExtractResult(subdomain='lol1', domain='domain', suffix='com')
>>> tldextract.extract("http://sub.lol1.domain.com:8888/some/page"
ExtractResult(subdomain='sub.lol1', domain='domain', suffix='com')
>>> urlparse.urlparse("http://sub.lol1.domain.com:8888/some/page")
ParseResult(scheme='http', netloc='sub.lol1.domain.com:8888', path='/some/page', params='', query='', fragment='')

Note that tldextract properly handles sub-domains.

1
  • great answer, should be voted as the best one :) thanks Lluis
    – Tom St
    Commented Jul 16, 2020 at 12:32
19

urlparse.urlparse will split the URL into protocol, location, port, etc. You can then split the location by . to get the subdomain.

import urlparse
url = urlparse.urlparse(address)
subdomain = url.hostname.split('.')[0]
8
  • Works very good. I used it like so Node = urlparse.urlparse(address).hostname.split('.')[0]
    – Marko
    Commented Aug 3, 2011 at 12:49
  • 6
    What if it's an IP address? And what if it has a second level subdomain?
    – naktinis
    Commented Sep 25, 2013 at 16:48
  • 2
    Subdomains may contain multiple dots so api.test is also valid, just keep this in mind. If you want a good package for doing this check https://pypi.python.org/pypi/tldextract. Commented May 16, 2016 at 20:16
  • 6
    This is actually a pretty bad answer. It fails if there's no subdomain, returning the domain instead. It fails for IP addresses (ok, fine), and it fails for multiple subdomains, like web.host1.google.com.
    – mlissner
    Commented Nov 26, 2016 at 22:56
  • 4
    in python 3.x you need to import this via from urllib.parse import urlparse Commented Nov 7, 2020 at 18:41
5

Modified version of the fantastic answer here: How to extract top-level domain name (TLD) from URL

You will need the list of effective tlds from here

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tldFile:
    tlds = [line.strip() for line in tldFile if line[0] not in "/\n"]

class DomainParts(object):
    def __init__(self, domain_parts, tld):
        self.domain = None
        self.subdomains = None
        self.tld = tld
        if domain_parts:
            self.domain = domain_parts[-1]
            if len(domain_parts) > 1:
                self.subdomains = domain_parts[:-1]

def get_domain_parts(url, tlds):
    urlElements = urlparse(url).hostname.split('.')
    # urlElements = ["abcde","co","uk"]
    for i in range(-len(urlElements),0):
        lastIElements = urlElements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(lastIElements) # abcde.co.uk, co.uk, uk
        wildcardCandidate = ".".join(["*"]+lastIElements[1:]) # *.co.uk, *.uk, *
        exceptionCandidate = "!"+candidate

        # match tlds: 
        if (exceptionCandidate in tlds):
            return ".".join(urlElements[i:]) 
        if (candidate in tlds or wildcardCandidate in tlds):
            return DomainParts(urlElements[:i], '.'.join(urlElements[i:]))
            # returns ["abcde"]

    raise ValueError("Domain not in global list of TLDs")

domain_parts = get_domain_parts("http://sub2.sub1.example.co.uk:80",tlds)
print "Domain:", domain_parts.domain
print "Subdomains:", domain_parts.subdomains or "None"
print "TLD:", domain_parts.tld

Gives you:

Domain: example
Subdomains: ['sub2', 'sub1']
TLD: co.uk
1
4

A very basic approach, without any sanity checking could look like:

address = 'http://lol1.domain.com:8888/some/page'

host = address.partition('://')[2]
sub_addr = host.partition('.')[0]

print sub_addr

This of course assumes that when you say 'subdomain' you mean the first part of a host name, so in the following case, 'www' would be the subdomain:

http://www.google.com/

Is that what you mean?

2

What you are looking for is in: http://docs.python.org/library/urlparse.html

for example: ".".join(urlparse('http://www.my.cwi.nl:80/%7Eguido/Python.html').netloc.split(".")[:-2])

Will do the job for you (will return "www.my")

2
  • 2
    This assumes that the main domain name has two parts - which will fall down in certain cases, e.g. .co.uk addresses. Besides the UK, Israel, Brasil and Japan all have formal second level domains, and there are probably others.
    – Thomas K
    Commented Aug 3, 2011 at 12:03
  • My answer deals with this problem using a list of valid TLDs.
    – Acorn
    Commented Aug 3, 2011 at 12:24
1

First of All import tldextract, as this splits the URL into its constituents like: subdomain. domain, and suffix.

import tldextract

Then declare a variable (say ext) that stores the results of the query. We also have to provide it with the URL in parenthesis with double quotes. As shown below:

ext = tldextract.extract("http://lol1.domain.com:8888/some/page")

If we simply try to run ext variable, the output will be:

ExtractResult(subdomain='lol1', domain='domain', suffix='com')

Then if you want to use only subdomain or domain or suffix, then use any of the below code, respectively.

ext.subdomain

The result will be:

'lol1'
ext.domain

The result will be:

'domain'
ext.suffix

The result will be:

'com'

Also, if you want to store the results only of subdomain in a variable, then use the code below:

Sub_Domain = ext.subdomain

Then Print Sub_Domain

Sub_Domain

The result will be:

'lol1'
1

Standardize all domains to start with www. unless they have a subdomain.

from urllib.parse import urlparse
    
def has_subdomain(url):
    if len(url.split('.')) > 2:
        return True
    else:
        return False 

domain = urlparse(url).netloc
        
if not has_subdomain(url):
        domain_name = 'www.' + domain
        url = urlparse(url).scheme + '://' + domain
0

For extracting the hostname, I'd use urlparse from urllib2:

>>> from urllib2 import urlparse
>>> a = "http://lol1.domain.com:8888/some/page"
>>> urlparse.urlparse(a).hostname
'lol1.domain.com'

As to how to extract the subdomain, you need to cover for the case that there FQDN could be longer. How you do this would depend on your purposes. I might suggest stripping off the two right most components.

E.g.

>>> urlparse.urlparse(a).hostname.rpartition('.')[0].rpartition('.')[0]
'lol1'
0

We can use https://github.com/john-kurkowski/tldextract for this problem...

It's easy.

>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> (ext.subdomain, ext.domain, ext.suffix)
('forums', 'bbc', 'co.uk')
0

tldextract separate the TLD from the registered domain and subdomains of a URL.

Installation

pip install tldextract

For the current question:

import tldextract

address = 'http://lol1.domain.com:8888/some/page'
domain = tldextract.extract(address).domain
print("Extracted domain name : ", domain)

The output:

Extracted domain name :  domain

In addition, there is a number of examples which is extremely related with the usage of tldextract.extract side.

0

Using python 3 (I'm using 3.9 to be specific), you can do the following:

from urllib.parse import urlparse

address = 'http://lol1.domain.com:8888/some/page'

url = urlparse(address)

url.hostname.split('.')[0]
0
import re

def extract_domain(domain):
   domain = re.sub('http(s)?://|(\:|/)(.*)|','', domain)
   matches = re.findall("([a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$", domain)
   if matches:
       return matches[0]
   else:
       return domain

def extract_subdomains(domain):
   subdomains = domain = re.sub('http(s)?://|(\:|/)(.*)|','', domain)
   domain = extract_domain(subdomains)
   subdomains = re.sub('\.?'+domain,'', subdomains)
   return subdomains

Example to fetch subdomains:

print(extract_subdomains('http://lol1.domain.com:8888/some/page'))
print(extract_subdomains('kota-tangerang.kpu.go.id'))

Outputs:

lol1
kota-tangerang

Example to fetch domain

print(extract_domain('http://lol1.domain.com:8888/some/page'))
print(extract_domain('kota-tangerang.kpu.go.id'))

Outputs:

domain.com
kpu.go.id

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.