CodeSOD: Brillant Python Programmers

CodeSOD: Brillant Python Programmers

Sandra from InitAg (previously) tries to keep the team's code quality up. The team she's on uses CI, code reviews, linting and type checking, and most important: hiring qualified people. Overall, the team's been successful recently. Recently.

The company got its start doing data-science, which meant much of the initial code was written by brilliant PhDs who didn't know the first thing about writing software. Most of that code has been retired, but it is impossible to dispatch all of it.

Which brings us to Stan. Stan was a one-man dev-team/sysadmin for a mission critical piece of software. No one else worked on it, no one else looked at it, but that was "okay" because Stan was happy to work evenings and weekends without anyone even suggesting it. Stan loved his work, perhaps a little too much. And as brilliant as Stan was, when it came to software engineering, he was at best "brillant".

Which brings us to a file called utils/file_io.py. As you might gather from the full name there, this is a "utility" module stuffed with "file and IO" related functions. Let's look at a few:

def del_rw(action, name, exc):
    """
    See https://stackoverflow.com/questions/21261132/shutil-rmtree-to-remove-readonly-files
    Is used by shutil.rmtree to remove read only files as well
    """
    os.chmod(name, stat.S_IWRITE)
    os.remove(name)

This, I think, is a case of "maybe this shouldn't be a function?" Given that it's only two lines, it's more clear what you're doing if you just include the lines. Also, it takes three parameters and uses one of them.

def safe_rmtree(remove_dir, minimum_parent_dir, min_parent_parts=2):
    """
    Recursively removes all files in a directory that should contain the min parent dir
    :param remove_dir:
    :param minimum_parent_dir:
    :param min_parent_parts: nr of folders that the parent should contain to avoid removing eg C:
    :return:
    """
    if not os.path.exists(remove_dir):
        print(f"WARNING in safe_rmtree: {remove_dir} does not exist, nothing is removed")
        return

    remove_path = pathlib.Path(remove_dir)
    minimum_parent_path = pathlib.Path(minimum_parent_dir)

    if not minimum_parent_path.is_dir():
        raise AssertionError("Min parent dir not a valid dir {}".format(minimum_parent_dir))

    if len(minimum_parent_path.parts) <= min_parent_parts:
        raise AssertionError(
            "Unsafe removal of {}: minimum parent dir is too short {}".format(
                remove_dir, minimum_parent_dir
            )
        )

    if minimum_parent_path in remove_path.parents:
        print("REMOVE {}".format(remove_dir))
        shutil.rmtree(remove_dir, onerror=del_rw)  # use del_rw to also remove read only files
    else:
        raise AssertionError(
            "Unsafe removal of {}: does not contain minimum parent dir {}".format(
                remove_dir, minimum_parent_dir
            )
        )

    time.sleep(0.1)

I definitely don't like this. We do at least use all of our parameters. While the documentation doesn't really make it clear what they all do, this will remove a directory if and only if the directory is contained under minimum_parent_dir, and the minimum_parent_dir is at least min_parent_parts levels deep.

I understand, broadly, why you want some checks around a potentially destructive operation, but I have to wonder about why this is the set of checks we added.

Also, why the tenth of a second sleep at the end? shutil isn't doing things on background threads, it just manipulates the file-system via syscalls.

Now, what's notable about this one is that we're using the pathlib.Path API for interacting with file paths. This is the correct way to do things in Python, which is why this method is just funny:

def file_name_from_path(path, extension=False):
    """
    Get the name of a file (without the extension) given a path
    :param path:
    :param extension: default false, but if true fn will include extension
    :return: a string with the file name (optional extension)
    """
    basename = ntpath.basename(path)

    if extension is True:
        return basename
    else:
        return os.path.splitext(basename)[0]

This reinvents methods that are already in pathlib.Path. I also like that the documentation contradicts itself: this returns the name of the file (without the extension) except when it also returns the extension.

def create_dir(new_dir, remove_data=False, verbose=False):
    """
    Create a new directory. NOTE: this function only creates one new folder, the root
    folder must already exist.
    :param new_dir: directory to create
    :param remove_data: if yes, remove existing data in folder.
    :param verbose:
    :return: a string containing the new directory
    """

    if remove_data and os.path.exists(new_dir):
        shutil.rmtree(new_dir)
        time.sleep(0.1)

    if os.path.exists(new_dir):
        if verbose:
            print("Output directory {} already exists".format(new_dir))
    else:
        os.makedirs(new_dir)

    return new_dir

This function creates a new directory and possibly deletes the old one. Not "safely", though, but hey, that random sleep is back.

def dirs_in_dir(data_dir, contains=""):
    """
    Get the paths of all directories in a directory containing a set of characters, the basename
    :param data_dir: directory for which to list all paths with a base name
    :param contains: characters that should be in the name of the directory
    :return: a list of all directory paths
    """
    if not os.path.exists(data_dir):
        print("Warning in get_dirs_in_dir: directory {} does not exist".format(data_dir))
        return

    dir_paths = [
        os.path.join(data_dir, d)
        for d in os.listdir(data_dir)
        if (os.path.isdir(os.path.join(data_dir, d)) and contains.lower() in d.lower())
    ]

    return dir_paths

This is another reinvention of functionality that already exists in pathlib.Path. I'd say, "And they already know about that!" but they clearly don't; they copied code off StackOverflow and didn't read it.

How about writing to a log file in possibly the slowest way possible?

def write_to_log(text_string, file_path, level="INFO", code="I_NOTDEFINED", entry="PROJ"):
    print(f"Log: {text_string}", flush=True)
    if not os.path.exists(file_path):
        file = open(file_path, "w")
        file.close()

    lookup_dict = error_lookup_dict
    responsible = lookup_dict[code]["responsible"]

    with open(file_path, "a") as file:
        dt_string = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        log_text = "{}; {}; {}; {}; {}; {}".format(
            dt_string, entry, level, responsible, code, text_string + "\n"
        )
        file.write(log_text)

Most logging frameworks open a file when you start, and keep the handle open until you're done logging. This one not only checks if it exists, but reopens it again and again for every write. And while it's at it, it also prints what it's logging. And Python has a built-in logging framework that definitely handles all these features.

Finally, what file_io library would be complete without a way to call shell commands?

def run_command_window(cmd, log_path=None):
    std_output = []

    updated_env = {**os.environ, "PYTHONPATH": str(config.repos_dir)}

    with Popen(
        cmd,
        stdout=PIPE,
        bufsize=1,
        universal_newlines=True,
        shell=sys.platform == "linux",
        env=updated_env,
    ) as p:
        for line in p.stdout:
            print(line, end="", flush=True)  # process line here
            std_output.append(line)

            if log_path:
                with open(log_path, "a") as log:
                    log.write(line)

    return p.returncode, std_output

Broadly speaking, if you're trying to automate things via Python, you might come up with something very much like this. But I draw your attention to their updated_env which populates a PYTHONPATH variable: they're calling Python code by shelling out and launching a new interpreter.

Sandra writes:

This is one of the better files in the project.
Thankfully, this one has a happy ending: management is well aware that this project is legacy garbage, and has authorized a full-scale rework to make it maintainable. Whether the guy doing the work escapes with his sanity intact is another question

Honestly, as "I don't actually know Python but I'm very smart," code goes, this isn't so bad. I mean, it's terrible, but it's clearly written. Even the parts that shouldn't have been written are clean and easy to understand. That, I suppose, just makes it worse, doesn't it. Picking through the code that needs to be thrown away will be harder than one expects because you can't just go, "kill it with fire," you have to think about what you're throwing away.

[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.

This post originally appeared on The Daily WTF.

Sandra from InitAg (previously) tries to keep the team's code quality up. The team she's on uses CI, code reviews, linting and type checking, and most important: hiring qualified people. Overall, the team's been successful recently. Recently.

The company got its start doing data-science, which meant much of the initial code was written by brilliant PhDs who didn't know the first thing about writing software. Most of that code has been retired, but it is impossible to dispatch all of it.

Which brings us to Stan. Stan was a one-man dev-team/sysadmin for a mission critical piece of software. No one else worked on it, no one else looked at it, but that was "okay" because Stan was happy to work evenings and weekends without anyone even suggesting it. Stan loved his work, perhaps a little too much. And as brilliant as Stan was, when it came to software engineering, he was at best "brillant".

Which brings us to a file called utils/file_io.py. As you might gather from the full name there, this is a "utility" module stuffed with "file and IO" related functions. Let's look at a few:

def del_rw(action, name, exc):
    """
    See https://stackoverflow.com/questions/21261132/shutil-rmtree-to-remove-readonly-files
    Is used by shutil.rmtree to remove read only files as well
    """
    os.chmod(name, stat.S_IWRITE)
    os.remove(name)

This, I think, is a case of "maybe this shouldn't be a function?" Given that it's only two lines, it's more clear what you're doing if you just include the lines. Also, it takes three parameters and uses one of them.

def safe_rmtree(remove_dir, minimum_parent_dir, min_parent_parts=2):
    """
    Recursively removes all files in a directory that should contain the min parent dir
    :param remove_dir:
    :param minimum_parent_dir:
    :param min_parent_parts: nr of folders that the parent should contain to avoid removing eg C:
    :return:
    """
    if not os.path.exists(remove_dir):
        print(f"WARNING in safe_rmtree: {remove_dir} does not exist, nothing is removed")
        return

    remove_path = pathlib.Path(remove_dir)
    minimum_parent_path = pathlib.Path(minimum_parent_dir)

    if not minimum_parent_path.is_dir():
        raise AssertionError("Min parent dir not a valid dir {}".format(minimum_parent_dir))

    if len(minimum_parent_path.parts) <= min_parent_parts:
        raise AssertionError(
            "Unsafe removal of {}: minimum parent dir is too short {}".format(
                remove_dir, minimum_parent_dir
            )
        )

    if minimum_parent_path in remove_path.parents:
        print("REMOVE {}".format(remove_dir))
        shutil.rmtree(remove_dir, onerror=del_rw)  # use del_rw to also remove read only files
    else:
        raise AssertionError(
            "Unsafe removal of {}: does not contain minimum parent dir {}".format(
                remove_dir, minimum_parent_dir
            )
        )

    time.sleep(0.1)

I definitely don't like this. We do at least use all of our parameters. While the documentation doesn't really make it clear what they all do, this will remove a directory if and only if the directory is contained under minimum_parent_dir, and the minimum_parent_dir is at least min_parent_parts levels deep.

I understand, broadly, why you want some checks around a potentially destructive operation, but I have to wonder about why this is the set of checks we added.

Also, why the tenth of a second sleep at the end? shutil isn't doing things on background threads, it just manipulates the file-system via syscalls.

Now, what's notable about this one is that we're using the pathlib.Path API for interacting with file paths. This is the correct way to do things in Python, which is why this method is just funny:

def file_name_from_path(path, extension=False):
    """
    Get the name of a file (without the extension) given a path
    :param path:
    :param extension: default false, but if true fn will include extension
    :return: a string with the file name (optional extension)
    """
    basename = ntpath.basename(path)

    if extension is True:
        return basename
    else:
        return os.path.splitext(basename)[0]

This reinvents methods that are already in pathlib.Path. I also like that the documentation contradicts itself: this returns the name of the file (without the extension) except when it also returns the extension.

def create_dir(new_dir, remove_data=False, verbose=False):
    """
    Create a new directory. NOTE: this function only creates one new folder, the root
    folder must already exist.
    :param new_dir: directory to create
    :param remove_data: if yes, remove existing data in folder.
    :param verbose:
    :return: a string containing the new directory
    """

    if remove_data and os.path.exists(new_dir):
        shutil.rmtree(new_dir)
        time.sleep(0.1)

    if os.path.exists(new_dir):
        if verbose:
            print("Output directory {} already exists".format(new_dir))
    else:
        os.makedirs(new_dir)

    return new_dir

This function creates a new directory and possibly deletes the old one. Not "safely", though, but hey, that random sleep is back.

def dirs_in_dir(data_dir, contains=""):
    """
    Get the paths of all directories in a directory containing a set of characters, the basename
    :param data_dir: directory for which to list all paths with a base name
    :param contains: characters that should be in the name of the directory
    :return: a list of all directory paths
    """
    if not os.path.exists(data_dir):
        print("Warning in get_dirs_in_dir: directory {} does not exist".format(data_dir))
        return

    dir_paths = [
        os.path.join(data_dir, d)
        for d in os.listdir(data_dir)
        if (os.path.isdir(os.path.join(data_dir, d)) and contains.lower() in d.lower())
    ]

    return dir_paths

This is another reinvention of functionality that already exists in pathlib.Path. I'd say, "And they already know about that!" but they clearly don't; they copied code off StackOverflow and didn't read it.

How about writing to a log file in possibly the slowest way possible?

def write_to_log(text_string, file_path, level="INFO", code="I_NOTDEFINED", entry="PROJ"):
    print(f"Log: {text_string}", flush=True)
    if not os.path.exists(file_path):
        file = open(file_path, "w")
        file.close()

    lookup_dict = error_lookup_dict
    responsible = lookup_dict[code]["responsible"]

    with open(file_path, "a") as file:
        dt_string = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        log_text = "{}; {}; {}; {}; {}; {}".format(
            dt_string, entry, level, responsible, code, text_string + "\n"
        )
        file.write(log_text)

Most logging frameworks open a file when you start, and keep the handle open until you're done logging. This one not only checks if it exists, but reopens it again and again for every write. And while it's at it, it also prints what it's logging. And Python has a built-in logging framework that definitely handles all these features.

Finally, what file_io library would be complete without a way to call shell commands?

def run_command_window(cmd, log_path=None):
    std_output = []

    updated_env = {**os.environ, "PYTHONPATH": str(config.repos_dir)}

    with Popen(
        cmd,
        stdout=PIPE,
        bufsize=1,
        universal_newlines=True,
        shell=sys.platform == "linux",
        env=updated_env,
    ) as p:
        for line in p.stdout:
            print(line, end="", flush=True)  # process line here
            std_output.append(line)

            if log_path:
                with open(log_path, "a") as log:
                    log.write(line)

    return p.returncode, std_output

Broadly speaking, if you're trying to automate things via Python, you might come up with something very much like this. But I draw your attention to their updated_env which populates a PYTHONPATH variable: they're calling Python code by shelling out and launching a new interpreter.

Sandra writes:

This is one of the better files in the project.
Thankfully, this one has a happy ending: management is well aware that this project is legacy garbage, and has authorized a full-scale rework to make it maintainable. Whether the guy doing the work escapes with his sanity intact is another question

Honestly, as "I don't actually know Python but I'm very smart," code goes, this isn't so bad. I mean, it's terrible, but it's clearly written. Even the parts that shouldn't have been written are clean and easy to understand. That, I suppose, just makes it worse, doesn't it. Picking through the code that needs to be thrown away will be harder than one expects because you can't just go, "kill it with fire," you have to think about what you're throwing away.

[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.

This post originally appeared on The Daily WTF.

Leave a Reply

Your email address will not be published. Required fields are marked *