19
\$\begingroup\$

To stay in practice with my Python I've decided to write an explicit song checker. It checks each word in the song against an array of 14 explicit words contained in the text file. I've decided not to include the words that are to be checked against, for I am not sure about the rule of having explicit words in a program's code. Feedback on efficiency and structure is what I'm going for mostly. Style and other critiques are invited as well.

explicit_song_checker.py

explicit_words = [
    #explicit words not shown 
]

def isClean(song_path):
    with open(song_path) as song:
        for line in song:
            words = line.split(" ")
            for word in words:
                for explicit_word in explicit_words:
                    if word == explicit_word:
                        return False

        return True

def main():
    song = raw_input("Enter path to song: \n")
    if isClean(song):
        print("CLEAN")
    else:
        print("EXPLICIT")

if __name__ == '__main__':
    main()
\$\endgroup\$
5
  • 6
    \$\begingroup\$ The only tiny thing (not big enough for an answer, IMHO) I want to add to all the great answers already given, is about naming variables. It should be clear from the name what a variable contains. What does another programmer expect that a variable named song contains? It cannot be a song, because that is something that you hear. So probably the lyrics then? Or the file? Ah, the path to the file that contains the lyrics of the song. I would try to make variable names as "explicit" as possible ;-) So song_path (as used in the function) is already much better :-) \$\endgroup\$
    – wovano
    Commented Jun 15, 2019 at 19:31
  • 9
    \$\begingroup\$ As a counterexample, check out the Scunthorpe problem for what can happen if your check is overzealous… \$\endgroup\$
    – gidds
    Commented Jun 16, 2019 at 0:32
  • \$\begingroup\$ Not sure if scope creeping the review, but would be nice if your script could take a path to a folder and print all songs inside it. \$\endgroup\$ Commented Jun 17, 2019 at 8:19
  • \$\begingroup\$ why not use a dictionary and some sort of hash. a linear search would be horrible \$\endgroup\$ Commented Jun 18, 2019 at 7:23
  • 1
    \$\begingroup\$ There is a meta CR post that involves this post \$\endgroup\$ Commented Jun 19, 2019 at 22:46

6 Answers 6

25
\$\begingroup\$

I recommend practicing Python 3 rather than Python 2 these days.

According to PEP 8, isClean() should be is_clean().

One or more of your loops could be replaced by some use of the any() function. Note that this suggests that an is_explicit() function would be a more natural concept than an is_clean() function.

I would expect song lyrics to contain a mixture of uppercase and lowercase. Words may be delimited by punctuation as well as spaces. Therefore, words = line.split(" ") is probably too naïve. Furthermore, I would expect each song to be short enough to fit entirely in memory very comfortably, so processing each file line by line is an unnecessary complication.

I would rewrite the program to use a regular expression instead.

import re

explicit_words = [
    …
]

is_explicit = re.compile(
    r'\b(?:' +
        '|'.join(re.escape(w) for w in explicit_words) +
    r')\b',
    re.IGNORECASE).search

def main():
    with open(raw_input("Enter path to song: ")) as song:
        print("EXPLICIT" if is_explicit(song.read()) else "CLEAN")

if __name__ == '__main__':
    main()

The way I chose to write the is_explicit() function above might show too much Haskell influence for some Python programmers' taste. Here's a more conventional way of writing it:

def is_explicit(text):
    return re.search(
        r'\b(?:' +
            '|'.join(re.escape(w) for w in explicit_words) +
        r')\b',
        text,
        re.IGNORECASE
    )
\$\endgroup\$
4
  • \$\begingroup\$ The regex answer has a variety of things I don't like. Firstly, this compiles the regex every time. 2nd, regex has no guarantees of O(1) like a set would. \$\endgroup\$ Commented Jun 17, 2019 at 23:50
  • \$\begingroup\$ @OscarSmith The program calls the helper function once from main(), and the word list contains just 14 entries, so neither objection was a significant concern in this solution. \$\endgroup\$ Commented Jun 18, 2019 at 0:00
  • \$\begingroup\$ Oh, I was mainly commenting on the bottom block, where it's in the function. \$\endgroup\$ Commented Jun 18, 2019 at 0:21
  • 2
    \$\begingroup\$ I don't think using the regular expression makes this code any easier to understand. xkcd.com/1171 \$\endgroup\$ Commented Jun 18, 2019 at 7:32
21
\$\begingroup\$

Splitting line into words using words = line.split(" ") will only split the line on space characters. If two words are separated by a tab character, or other white space character other than a space, the split will not happen at that point, which may confuse your explicit checker.

Using words = line.split() would split on spaces, tab characters, and any other white space characters.

You might also want to investigate splitting using the word break regular expression \b, which would treat non-word characters as break points, so that a line containing an explicit word immediately followed by a comma (such as "explicit,") or period, or semicolon, etc. won't sneak through the filter.


            for explicit_word in explicit_words:
                if word == explicit_word:
                    return False

Python has great searching built into it. Instead of explicitly writing a loop over all explicit words, you could just ask Python to tell you if a word is in the list of explicit words:

            if word in explicit_words:
                return False

Or better, use a set() of explicit words, and the in operation drops from \$O(n)\$ to \$O(1)\$ time complexity:

explicit_words = {
    "explicit", "bad", "ugly", "offensive", "words" 
}

# ...

            if word in explicit_words:
                return False

We've improved things a little, but we still have an explicit (pun intended) loop over the list of words in a line. We can use the any() method, and a list comprehension, to get Python to do the looping for us, which should be faster.

    for line in song:
        words = line.split()
        if any(word in explicit_words for word in words):
            return False

If word in explicits returns True for any of the word in the list words, the any() method will "short-circuit" and immediately return True without examining the remaining items in words.


If "explicit" is a bad word, and the song lyrics contains "Explicit" or "EXPLICIT", you wouldn't want to mark the song as "CLEAN", right? Perhaps you want word.lower() in explicit_words? As @wizzwizz4 pointed out, .casefold() is better than .lower().


Resulting method:

explicit_words = set(explicit_word.casefold()
    for explicit_word in ("explicit", "bad", "ugly", "offensive", "words"))

def isClean(song_path):
    with open(song_path) as song:
        for line in song:
            words = line.split()
            if any(word.casefold() in explicit_words for word in words):
                return False    
        return True
\$\endgroup\$
5
  • 1
    \$\begingroup\$ Why does it need to iterate over the song lines? song.split() would also split on \n (and \r, of course). Call me unpythonic, but I would directly return any(word.lower() in explicit_words for word in song.split()) \$\endgroup\$
    – 0x2b3bfa0
    Commented Jun 15, 2019 at 12:22
  • \$\begingroup\$ @Helio I expect you would need song.read().split(), not song.split(). As for why I didn’t? I considered making it one-statement, and decided against doing so because the split() doesn’t consider punctuation as word separators. That area still needed replacement with a regex \b style word splitting, so embedding it inside of the any(...) hides the work that still needed to be done. \$\endgroup\$
    – AJNeufeld
    Commented Jun 15, 2019 at 12:37
  • \$\begingroup\$ yes, my bad; I've posted the comment thoughtlessly. A naïve alternative to regular expressions would be replacing punctuation symbols by spaces and then running split. However, that would be too clumsy. If this code is not an exercise, maybe the OP should use some specialized framework like NLTK. \$\endgroup\$
    – 0x2b3bfa0
    Commented Jun 15, 2019 at 12:56
  • 3
    \$\begingroup\$ Why not str.casefold instead of str.lower? They're the same for (most of) Latin, but different for stuff like Greek. \$\endgroup\$
    – wizzwizz4
    Commented Jun 15, 2019 at 16:17
  • \$\begingroup\$ @wizzwizz4 Fair point; well made. All the words in explicit_words would need to be run through casefold() as well. \$\endgroup\$
    – AJNeufeld
    Commented Jun 15, 2019 at 16:26
16
\$\begingroup\$

Some suggestions:

  • Using argparse to parse arguments is preferable to any kind of interactive input, because it means that
    • it's much easier to automate and integrate in other scripts,
    • you could trivially support passing multiple songs in one command,
    • it's what any experienced shell user would expect, and
    • if the user adds --help or doesn't pass any arguments they will get a synopsis for the command.
  • The list of explicit words should not be part of the code, because it is configuration and is very likely to change often. You'll want to put words in a separate file and read them in during initialization. You could very easily make this an optional argument (with argparse) so that you could do fancy stuff like ./explicit_song_checker.py --words=./words.txt --words=./other.txt ./kill-the-queen.txt ./bulls-on-parade.txt
  • black will automatically format your code to be pretty idiomatic, such as using two empty lines between top-level functions.
  • flake8 will check for style issues. I believe it will assume that you're writing Python 3, and so will complain about the use of raw_input (called input in Python 3).
  • mypy can be used to enforce type hints, which is a super helpful way to make your code clearer without resorting to comments.
\$\endgroup\$
0
4
\$\begingroup\$
  • To simplify your input you could change word == explicit_word to explicit_word in word.

    This can lead to the Scunthorpe problem, and so from here you can check if the word only contains prefixes and suffixes.

    Given the amount of common prefixes alone, it's unlikely you'd catch all variations using a blacklist alone. It would also catch gibberish/semigibberish that you may not have thought of - "Antioverfucked" / "Overantifucked".

  • Your code misses words without spaces "I don't give afuck" and so you may want to perform a 'similarity check'.

    This could also match thinks like "sexx".

  • Your code doesn't reasonably handle ZWSP and other forms of Unicode abuse.

    >>> print('exp\u200blitive')
    exp​litive
    >>> 'exp\u200blitive' == 'explitive'
    False
    >>> print('exp\u1DD3litive')
    expᷓlitive
    
  • Your code doesn't reasonably detect leet speak.

  • Your code doesn't reasonably detect creative Unicode character selections.

    >>> print('\u2131\u01DA\u0255\u212A')
    ℱǚɕK
    
  • Your code doesn't reasonably handle phonetic abuse.

    "Phuck" is pronounced "Fuck".

    "Bend Dover" sounds like "bend over".

    "I CUP" spelt out is "I see you pee".

    "See you next Tuesday" each starting phonetic spells "CU...".


When I say doesn't reasonably, I mean if someone combined the above to get something ridiculous like "4n7ipᷓhǚ​ɕK" your word list would have to be ridiculous in size to match it.

It should also be fairly obvious that someone probably will find a way to bypass your check no matter what you do.

Practicality vs Perfection

Given how widespread leet speak is you should handle that at a bare minimum. This is fairly easy, you just 'translate' from this alphabet. And perform the checks you're already doing.

You should then probably use a 'similar check' to match "sexx" or "afuck". This is as duplicating a letter or missing a space is readable and easy.
This check may result in problems as "Scunthorpe" could be understood as "s cunt thorp e" so would be split into two two known words and two extra letters, and then the similarity check my come back low enough to say it's a swearword.

From here ZWSP and other unicode abuse probably are the next most common. And with things like deasciiify it's only getting easier.

\$\endgroup\$
2
  • 1
    \$\begingroup\$ Yeah, if catching explicit language was easy, you wouldn't have things like controversies over video game language filters getting over-aggressive and filtering out legitimate speech. \$\endgroup\$
    – nick012000
    Commented Jun 17, 2019 at 2:37
  • 1
    \$\begingroup\$ If you only check for prefixes and suffixes, you still have the Penistone and Cockfosters problems. \$\endgroup\$
    – Mike Scott
    Commented Jun 17, 2019 at 16:43
2
\$\begingroup\$
explicit_words = [
    #explicit words not shown 
]

Instead of embedding the word list within the program, you might want to treat it as configuration, and load it from a data file. The advantage of this is that it can allow the use of alternative configurations by command-line option (e.g. use a list of French obscenities for French-language lyrics).


On efficiency: we're only interested in the set of words in the lyrics. Given that many songs have repeated lyrics, then it's probably worth reducing the lyrics to a set of words before comparing that set with the banned words (use & for set intersection), or set.isdisjoint() for a potentially faster test.

\$\endgroup\$
1
\$\begingroup\$

The code feedback is already done wonderfully. I'd like to write about some pointers that might be important, apart from the leet speak:

  • The word Phoque , is French for seal. There are numerous examples of words which are perfectly meaningful in some language but if pronounced in English, might be considered offensive. You might need to do a phonetic similarity check
  • The context can be taken into account. For example, the word 'blow' or 'mom' need not be offensive. It becomes offensive only when it is surrounded by contextual words that make it offensive, ('job', etc). In these cases, you might need to come up with an algorithm that outputs a probability score and filter songs based on that.
\$\endgroup\$
1
  • \$\begingroup\$ I'd like to take this up as a project. This has tremendous scope. For example, you can integrate this into the smartphones of kids . If youre interested, drop me a message \$\endgroup\$ Commented Jun 18, 2019 at 6:14

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.