Bandcamp scraper generates database links for online publication
The code in this article can be found here: https://github.com/pfeiffer3000/History-Database-Creator
Each week I archive my weekly drum & bass sets from Hack The Planet. That involves uploading an audio recording to Soundcloud, generating an HTML table of all the songs I played, adding AI imaged generated during the show, and fussing with WordPress behind-the-scenes things.
Recently (June 2024), WordPress added AI functionality to help write blog posts. They offer AI text gen that helps synthesize your post for distribution purposes, generate AI images based on the post’s content, or give you helpful feedback on the structure, content, or layout of the post. Usually, I have software that helps me generate my content, but it’s generated from copy that I wrote and is based on the data I generated during the show. I don’t need the AI’s help, because it would probably just muck up my flow and get in the way. However, I decided to see what it would say…
It suggested that I add more links to songs, artists, and labels in the post’s content. I didn’t expect the AI to generate a good idea like that! So, I took the AI’s suggestion and started building a Python program that would generate a database that had just the track title, artist, album, label, and Bandcamp URL for each song that I play, or have played in the past. I felt a little silly taking advice from an AI, which was a first for me, but it was such a good idea. Well played, WordPress AI!
I used Bandcamp because their search is kind of tough to navigate. It’s unforgiving in helping find things that are adjacent or nearby to the search terms. In this case it worked wonderfully. I wanted to search for a specific match to the artist’s name and track title. If Bandcamp returns a result at all, then the first one is probably the right track. Otherwise, it comes up with no results, and we can assume that it doesn’t exist on Bandcamp. I know that’s not necessarily true, but this method allowed me to find links to a majority of my dnb collection. And if we can support a few people’s music, then it’s better than supporting none!
The Python program below assumes that playlist histories are stored as .txt files. I use Rekordbox to keep track of my playlists, so the specifics of reading a Pioneer DJ playlist file are what’s shown in the code, specifically the ‘encoding=”utf-16″‘ while reading the playlist history file.
The program loads the a playlist, then loads the database (which is stored as a json file), then checks to see if the playlist tracks are in the database. If they are not, then the program scrapes Bandcamp with a get request, BeautifulSoup’s the crap out of it, find the link to the first result from a search query, then saves new tracks the database. I have this currently running on my past playlist histories, which is over 500 playlists, and it has already generated more than 5500 links for over 9000 songs! As I watch the links printed to the screen, I can see that a lot of them are right. This is totally anecdotal, but pretty darn good so far! It’s also taken a couple of hours to get this far.
Once the database has finished populating, I’ll be able to check new playlists against the current entries and only add the new tracks. I do that in another program that I mentioned earlier about helping with the copy for Soundcloud, WordPress, dnbradio.com, Pacifica uploads, Twitch and YouTube data.
It’s an odd feeling to take advice from an AI, especially when it’s good advice. Btw, I used the AI image gen from WordPress as the image for this project–that kind of felt like the right thing to do.
History_database_creator.py
import os
import requests
from bs4 import BeautifulSoup
import json
class HistoryDB:
def __init__(self, history_file_name):
self.history_location = "" # put the path to the HISTORY files output by Rekordbox in the quotes
self.link_database_name = "link_database.json" # name of the database file
self.history_file = self.history_location + "\\" + history_file_name
def find_most_recent_history(self):
# find the most recent HISTORY file in the history_location
files = os.listdir(self.history_location)
files = [os.path.join(self.history_location, file) for file in files]
most_recent_file = max(files, key=os.path.getmtime)
print(f"Most recent HISTORY file: {most_recent_file}")
return most_recent_file
def generate_track_list(self):
# open the Rekordbox playlist history file and read the contents
with open(self.history_file, 'rt', newline="\n", encoding="utf-16") as file:
lines = file.readlines()
lines = [line.strip() for line in lines]
self.track_list = []
for line in lines:
track_info = line.split("\t")
track_dict = {
"track title": track_info[2],
"artist": track_info[3],
"album": track_info[4],
"label": track_info[5],
"itemurl": ""
}
self.track_list.append(track_dict)
print("Track list generated!")
def load_database(self):
# load the link_database.json file. This should have all the previous Bandcamp links
with open("link_database.json", "r") as fin:
self.link_database = json.load(fin)
print("Database loaded!")
def find_bandcamp_links(self):
# search for Bandcamp links for each track in the track_list unless it already exists in the link_database
print("Finding Bandcamp links")
print()
for track in self.track_list[1:]:
# check the link_database for previous entries
litemurl = next((ltrack['itemurl'] for ltrack in self.link_database if ltrack['track title'] == track['track title'] and ltrack['artist'] == track['artist']), False)
# if the track is already in the database, use the existing link
if litemurl != False:
track['itemurl'] = litemurl
# if the track is not in the database, search for a new link
else:
print()
print(f"Searching: {track['artist']} - {track['track title']}")
search_url = "https://www.bandcamp.com/search?q=" + track['artist'] + " " + track['track title']
search_url.replace(" ", "%2B")
results = requests.get(search_url)
soup = BeautifulSoup(results.text, "html.parser")
try:
track['itemurl'] = soup.find(class_="itemurl").a.get("href")
print(f" Bandcamp link found! --- {track['itemurl'][0:24]}...")
except:
track['itemurl'] = ""
print(" (No link found)")
self.link_database.append(track)
print("Finished searching Bandbamp links.")
def create_database(self):
# create a blank database
while os.path.exists(self.link_database_name):
self.link_database_name = self.link_database_name.split(".")[0] + "_new.json"
with open(self.link_database_name, "w") as fout:
json.dump(self.track_list, fout)
print(f"Database created: {self.link_database_name}")
def update_database(self):
# update the link_database.json file with the new links
with open(self.link_database_name, "w") as fout:
json.dump(self.link_database, fout)
print("Database updated!")
def generate_html_table(self):
# generate an HTML table of the currently loaded track_list from the history playlist
html_table = "<table>"
html_table += "<tr><th>Artist</th><th>Track Title</th><th>Label</th><th>Link</th></tr>"
for track in self.track_list[1:]:
if track['itemurl'] is None:
html_table += f"<tr><td>{track['artist']}</td><td>{track['track title']}</td><td>{track['label']}</td><td> </td></tr>"
else:
html_table += f"<tr><td>{track['artist']}</td><td>{track['track title']}</td><td>{track['label']}</td><td><a href='{track['itemurl']}'>Bandcamp Link</a></td></tr>"
html_table += "</table>"
with open("track_list_html_table.html", "w") as file:
file.write(html_table)
print("HTML table generated!")
if __name__ == "__main__":
input_name = ""
while input_name == "":
history_file_name = input("Enter the name of the HISTORY file: ")
hdb = HistoryDB(history_file_name)
hdb.generate_track_list()
hdb.load_database()
hdb.find_bandcamp_links()
hdb.update_database()
Discover more from DJ Pfeif
Subscribe to get the latest posts sent to your email.