Conversation
WanderingStar
left a comment
There was a problem hiding this comment.
Overall comments:
- Good division of work into bite-sized functions
- Unfortunately, many of the functions are single-purpose because they hard-code application logic in them. Think about whether you can factor things like the names of the directories out to make more generally useful functions that you could reuse elsewhere
Most of the things I comment on below are stylistic or for greater intelligibility of the code. However the use of \ for directory separators means that this code doesn't work on non-Windows OSes (and fails in a way that gives the runner very little information about what went wrong)
|
|
||
| def verify_dirs_exist(): | ||
| #This notebook requires a few directories | ||
| dirs = ["download", "download\csv", "download\excel"] |
There was a problem hiding this comment.
Are these directories intended to be literally named download\csv and download\excel? Or are csv and excel intended to be subdirectories under download? On non-Windows OSes, \ is not the directory separator character, so this does the former. You can use os.path.join('download', 'csv') to work cross-platform.
If the excel and csv directories are meant to be subdirectories of download, you don't need to make download separately. os.makedirs operates recursively.
| full_path = os.path.join(curpath, d) # join cwd with proposed d | ||
| create_dir_if_not_exists(full_path) | ||
|
|
||
| def create_dir_if_not_exists(full_path): |
There was a problem hiding this comment.
os.makedirs already does the file existence check, essentially.
try:
os.makedirs(full_path)
print(f"Created directory {full_path}")
except FileExistsError:
print(f"Directory {full_path} already existed")
The try/except version is slightly preferable because in the if version, there's a tiny chance that someone could create the directory between when you do the if and when you do the makedirs.
| import glob | ||
| import datetime as dt | ||
|
|
||
| def verify_dirs_exist(): |
There was a problem hiding this comment.
The naming of this function is a bit misleading. It doesn't just verify that they exist, it actually creates them if they don't exist. I think it should be called ensure_dirs_exist.
There was a problem hiding this comment.
I would also make this function take in the list of directories to create, so that it's re-usable.
| else: | ||
| print("Directory ", full_path, " already existed") | ||
|
|
||
| def generate_file_name_from_url(url): |
There was a problem hiding this comment.
A comment with a sample URL and the file name you're generating from it would make this much easier to understand.
| print("Directory ", full_path, " already existed") | ||
|
|
||
| def generate_file_name_from_url(url): | ||
| month_year = url.split("\\")[-1].split("_")[-1].split(".")[0] |
There was a problem hiding this comment.
Does url.split("\\") do anything for these URLs? It looks like a sample url is https://www.huduser.gov/portal/datasets/usps/ZIP_COUNTY_032010.xlsx, which doesn't contain the \ character.
There was a problem hiding this comment.
This is one of the (rare?) cases where a regular expression might make things clearer...
m = re.search('ZIP_COUNTY_(\d\d)(\d\d\d\d)\.xlsx', url)
month, year = m.groups()
| #open and get the excel dataframe | ||
| excel_df = process_excel_file(full_file_path) | ||
| #merge the excel file with the fips data | ||
| merged_df = fips_df.merge(excel_df) |
There was a problem hiding this comment.
It would be better if this function did not refer to fits_df, which is defined outside of the function. That should be passed in as an argument or defined within the function.
There was a problem hiding this comment.
The same is true for cur_year
| print(csv_path) | ||
| try: | ||
| merged_df.to_csv(csv_path, encoding='utf-8', index=False) | ||
| except: |
There was a problem hiding this comment.
This except is too broad. When I first tried to run this, the code did nothing because it was trying to write to a directory that didn't exist. It didn't show me the error because this clause swallowed it. If you're expecting a particular error, you should catch it as specifically as possible.
| merged_df.to_csv(csv_path, encoding='utf-8', index=False) | ||
| except: | ||
| #once we get to a Q that hasn't happened yet, we'll get an XLDRerror | ||
| print("Operation has completed") |
There was a problem hiding this comment.
Which operation has completed? This message could be more descriptive. "All files have been downloaded and processed"?
| try: | ||
| merged_df.to_csv(csv_path, encoding='utf-8', index=False) | ||
| except: | ||
| #once we get to a Q that hasn't happened yet, we'll get an XLDRerror |
There was a problem hiding this comment.
Is there a more straightforward way to detect this? Check to see if the input is empty? Check if year > now.year or year == now.year and month > now.month?
| import os | ||
| import time | ||
| import pandas as pd | ||
| import glob |
There was a problem hiding this comment.
I don't think you use glob in this code.
|
Oh, one thing I forgot to mention. Your requirements.txt needs |
WonderingStar to perform code review & LearningSomethingNew to update code on feedback.