Get subdomain from URL using Python

Question

For example, the address is:

Address = http://lol1.domain.com:8888/some/page

I want to save the subdomain into a variable so i could do like so;

print SubAddr
>> lol1

This questions should be useful: stackoverflow.com/questions/1066933/… — Acorn, Commented Aug 3, 2011 at 11:47

wjandrea · Accepted Answer · 2022-09-17 16:15:09Z

32

Package tldextract makes this task very easy, and then you can use urlparse as suggested if you need any further information:

>>> import tldextract
>>> tldextract.extract("http://lol1.domain.com:8888/some/page"
ExtractResult(subdomain='lol1', domain='domain', suffix='com')
>>> tldextract.extract("http://sub.lol1.domain.com:8888/some/page"
ExtractResult(subdomain='sub.lol1', domain='domain', suffix='com')
>>> urlparse.urlparse("http://sub.lol1.domain.com:8888/some/page")
ParseResult(scheme='http', netloc='sub.lol1.domain.com:8888', path='/some/page', params='', query='', fragment='')

Note that tldextract properly handles sub-domains.

edited Sep 17, 2022 at 16:15

wjandrea

32.6k9 gold badges67 silver badges94 bronze badges

answered May 1, 2015 at 13:05

Lluís Vilanova

9178 silver badges9 bronze badges

great answer, should be voted as the best one :) thanks Lluis
– Tom St
Commented Jul 16, 2020 at 12:32

Add a comment |

radtek · Accepted Answer · 2022-03-29 17:12:58Z

19

urlparse.urlparse will split the URL into protocol, location, port, etc. You can then split the location by . to get the subdomain.

import urlparse
url = urlparse.urlparse(address)
subdomain = url.hostname.split('.')[0]

edited Mar 29, 2022 at 17:12

radtek

36.1k13 gold badges148 silver badges113 bronze badges

answered Aug 3, 2011 at 11:47

Daniel Roseman

599k68 gold badges901 silver badges918 bronze badges

Works very good. I used it like so Node = urlparse.urlparse(address).hostname.split('.')[0]
– Marko
Commented Aug 3, 2011 at 12:49
6

What if it's an IP address? And what if it has a second level subdomain?
– naktinis
Commented Sep 25, 2013 at 16:48
2

Subdomains may contain multiple dots so api.test is also valid, just keep this in mind. If you want a good package for doing this check https://pypi.python.org/pypi/tldextract.
– sidneydobber
Commented May 16, 2016 at 20:16
6

This is actually a pretty bad answer. It fails if there's no subdomain, returning the domain instead. It fails for IP addresses (ok, fine), and it fails for multiple subdomains, like web.host1.google.com.
– mlissner
Commented Nov 26, 2016 at 22:56
4

in python 3.x you need to import this via from urllib.parse import urlparse
– Lord Elrond
Commented Nov 7, 2020 at 18:41

| Show 3 more comments

Community · Accepted Answer · 2017-05-23 12:25:31Z

Modified version of the fantastic answer here: How to extract top-level domain name (TLD) from URL

You will need the list of effective tlds from here

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tldFile:
    tlds = [line.strip() for line in tldFile if line[0] not in "/\n"]

class DomainParts(object):
    def __init__(self, domain_parts, tld):
        self.domain = None
        self.subdomains = None
        self.tld = tld
        if domain_parts:
            self.domain = domain_parts[-1]
            if len(domain_parts) > 1:
                self.subdomains = domain_parts[:-1]

def get_domain_parts(url, tlds):
    urlElements = urlparse(url).hostname.split('.')
    # urlElements = ["abcde","co","uk"]
    for i in range(-len(urlElements),0):
        lastIElements = urlElements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(lastIElements) # abcde.co.uk, co.uk, uk
        wildcardCandidate = ".".join(["*"]+lastIElements[1:]) # *.co.uk, *.uk, *
        exceptionCandidate = "!"+candidate

        # match tlds: 
        if (exceptionCandidate in tlds):
            return ".".join(urlElements[i:]) 
        if (candidate in tlds or wildcardCandidate in tlds):
            return DomainParts(urlElements[:i], '.'.join(urlElements[i:]))
            # returns ["abcde"]

    raise ValueError("Domain not in global list of TLDs")

domain_parts = get_domain_parts("http://sub2.sub1.example.co.uk:80",tlds)
print "Domain:", domain_parts.domain
print "Subdomains:", domain_parts.subdomains or "None"
print "TLD:", domain_parts.tld

Gives you:

Domain: example
Subdomains: ['sub2', 'sub1']
TLD: co.uk

Updated link to "list of effective tlds": wiki.mozilla.org/Public_Suffix_List#TLD_Lists, publicsuffix.org — Rivers, Commented Mar 3, 2021 at 14:46

Steve Mayne · Accepted Answer · 2011-08-03 11:44:39Z

4

A very basic approach, without any sanity checking could look like:

address = 'http://lol1.domain.com:8888/some/page'

host = address.partition('://')[2]
sub_addr = host.partition('.')[0]

print sub_addr

This of course assumes that when you say 'subdomain' you mean the first part of a host name, so in the following case, 'www' would be the subdomain:

http://www.google.com/

Is that what you mean?

answered Aug 3, 2011 at 11:44

Steve Mayne

22.7k4 gold badges51 silver badges49 bronze badges

Add a comment |

Benjamin K. · Accepted Answer · 2011-08-03 11:48:05Z

2

What you are looking for is in: http://docs.python.org/library/urlparse.html

for example: ".".join(urlparse('http://www.my.cwi.nl:80/%7Eguido/Python.html').netloc.split(".")[:-2])

Will do the job for you (will return "www.my")

answered Aug 3, 2011 at 11:48

Benjamin K.

1,1053 gold badges15 silver badges24 bronze badges

2

This assumes that the main domain name has two parts - which will fall down in certain cases, e.g. .co.uk addresses. Besides the UK, Israel, Brasil and Japan all have formal second level domains, and there are probably others.
– Thomas K
Commented Aug 3, 2011 at 12:03
My answer deals with this problem using a list of valid TLDs.
– Acorn
Commented Aug 3, 2011 at 12:24

Add a comment |

user14335364user14335364 · Accepted Answer · 2020-12-26 06:23:49Z

First of All import tldextract, as this splits the URL into its constituents like: subdomain. domain, and suffix.

import tldextract

Then declare a variable (say ext) that stores the results of the query. We also have to provide it with the URL in parenthesis with double quotes. As shown below:

ext = tldextract.extract("http://lol1.domain.com:8888/some/page")

If we simply try to run ext variable, the output will be:

ExtractResult(subdomain='lol1', domain='domain', suffix='com')

Then if you want to use only subdomain or domain or suffix, then use any of the below code, respectively.

ext.subdomain

The result will be:

'lol1'

ext.domain

The result will be:

'domain'

ext.suffix

The result will be:

'com'

Also, if you want to store the results only of subdomain in a variable, then use the code below:

Sub_Domain = ext.subdomain

Then Print Sub_Domain

Sub_Domain

The result will be:

'lol1'

Andres R · Accepted Answer · 2022-05-20 15:10:38Z

1

Standardize all domains to start with www. unless they have a subdomain.

from urllib.parse import urlparse
    
def has_subdomain(url):
    if len(url.split('.')) > 2:
        return True
    else:
        return False 

domain = urlparse(url).netloc
        
if not has_subdomain(url):
        domain_name = 'www.' + domain
        url = urlparse(url).scheme + '://' + domain

answered May 20, 2022 at 15:10

Andres R

1451 silver badge5 bronze badges

Add a comment |

MattH · Accepted Answer · 2011-08-03 11:46:05Z

For extracting the hostname, I'd use urlparse from urllib2:

>>> from urllib2 import urlparse
>>> a = "http://lol1.domain.com:8888/some/page"
>>> urlparse.urlparse(a).hostname
'lol1.domain.com'

As to how to extract the subdomain, you need to cover for the case that there FQDN could be longer. How you do this would depend on your purposes. I might suggest stripping off the two right most components.

E.g.

>>> urlparse.urlparse(a).hostname.rpartition('.')[0].rpartition('.')[0]
'lol1'

Prachit Patil · Accepted Answer · 2018-10-02 17:52:51Z

0

We can use https://github.com/john-kurkowski/tldextract for this problem...

It's easy.

>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> (ext.subdomain, ext.domain, ext.suffix)
('forums', 'bbc', 'co.uk')

answered Oct 2, 2018 at 17:52

Prachit Patil

4516 silver badges11 bronze badges

Add a comment |

ozturkib · Accepted Answer · 2020-11-02 14:25:41Z

0

tldextract separate the TLD from the registered domain and subdomains of a URL.

Installation

pip install tldextract

For the current question:

import tldextract

address = 'http://lol1.domain.com:8888/some/page'
domain = tldextract.extract(address).domain
print("Extracted domain name : ", domain)

The output:

Extracted domain name :  domain

In addition, there is a number of examples which is extremely related with the usage of tldextract.extract side.

answered Nov 2, 2020 at 14:25

ozturkib

1,62318 silver badges30 bronze badges

Add a comment |

s3bw · Accepted Answer · 2021-11-23 13:51:08Z

0

Using python 3 (I'm using 3.9 to be specific), you can do the following:

from urllib.parse import urlparse

address = 'http://lol1.domain.com:8888/some/page'

url = urlparse(address)

url.hostname.split('.')[0]

answered Nov 23, 2021 at 13:51

s3bw

3,0292 gold badges22 silver badges31 bronze badges

Add a comment |

Pausi · Accepted Answer · 2022-02-18 16:35:18Z

import re

def extract_domain(domain):
   domain = re.sub('http(s)?://|(\:|/)(.*)|','', domain)
   matches = re.findall("([a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$", domain)
   if matches:
       return matches[0]
   else:
       return domain

def extract_subdomains(domain):
   subdomains = domain = re.sub('http(s)?://|(\:|/)(.*)|','', domain)
   domain = extract_domain(subdomains)
   subdomains = re.sub('\.?'+domain,'', subdomains)
   return subdomains

Example to fetch subdomains:

print(extract_subdomains('http://lol1.domain.com:8888/some/page'))
print(extract_subdomains('kota-tangerang.kpu.go.id'))

Outputs:

lol1
kota-tangerang

Example to fetch domain

print(extract_domain('http://lol1.domain.com:8888/some/page'))
print(extract_domain('kota-tangerang.kpu.go.id'))

Outputs:

domain.com
kpu.go.id

Collectives™ on Stack Overflow

Get subdomain from URL using Python

12 Answers 12

Your Answer

Not the answer you're looking for? Browse other questions tagged
python
string
url
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

12 Answers 12

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged pythonstringurl or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
string
url
or ask your own question.