- Published on
Category Classification ChatGPT 3.5 vs. Embeddings
Currently, there are two obvious ways to classify text. ChatGPT-turbo 3.5 is probably the most popular at the moment. However, OpenAI recommends the Embeddings API for this purpose:
OpenAI's text embeddings measure the relatedness of text strings. Embeddings are commonly used for:
- Search (where results are ranked by relevance to a query string)
- Clustering (where text strings are grouped by similarity)
- Recommendations (where items with related text strings are recommended)
- Anomaly detection (where outliers with little relatedness are identified)
- Diversity measurement (where similarity distributions are analyzed)
- Classification (where text strings are classified by their most similar label)
At Backlink.nl, we want to categorize more than 3000 websites based on sixty categories. Since it is not clear which of the two models is the most effective, both models were tested using Python. The code used for this, as well as analysis of the results, can be found below.
sum = parseInt(num1) + parseInt(num2)
Code davinci-gpt-3
later edits
- fix for 404's
- fixed columns
- fix for non-www/www
import pandas as pd
import requests
from bs4 import BeautifulSoup
import openai
import csv
# Load the data from the CSV files
cat_df = pd.read_csv("cat.csv")
products_df = pd.read_csv("products.csv")
# Read the categories from the first column (index 0) of cat.csv
categories = cat_df.iloc[:, 0].tolist()
# Set up OpenAI API
openai.api_key = "MYWONDERFULAPIKEY"
# Function to categorize a URL based on the first 200 words
def categorize_url(url):
try:
# Scrape the first 200 words of the URL with a 10-second timeout
response = requests.get(url, timeout=10)
except requests.exceptions.RequestException:
# If the first attempt fails, try with the 'www' prefix or without it
try:
if "www." in url:
url = url.replace("www.", "")
else:
url = url.replace("://", "://www.")
response = requests.get(url, timeout=10)
except requests.exceptions.RequestException as e:
print(f"Error processing URL '{url}': {e}")
return []
# Check if the URL is reachable
if response.status_code != 200:
return []
soup = BeautifulSoup(response.content, "html.parser")
text = " ".join(soup.stripped_strings)[:200]
# Perform an OpenAI request to categorize the URL based on the first 200 words
response = openai.Completion.create(
engine="text-davinci-003",
prompt=f"Categorize the text strictly based on the given categories: Categories: {', '.join(categories)}\n Text: {text}\n Top 3 categories based on this text are:",
temperature=0.5,
max_tokens=50,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
)
# Extract the top 3 categories from the response
top_categories = response.choices[0].text.strip().split(", ")
return top_categories
# Write the output to output.csv
with open("output.csv", "w", n fewline="") as csvfile:
csvwriter = csv.writer(csvfile)
csvwriter.writerow(["url", "cat1", "cat2", "cat3", "cat4", "cat5"])
# Loop through the first 50 rows of products.csv
for i in range(100):
product = products_df.iloc[i]
# Get the URL from the second column (index 1)
url = product.iloc[1]
# Categorize the URL
top_categories = categorize_url(url)
# Prepare the output
output = [url] + top_categories + [""] * (5 - len(top_categories))
# Write the output to the CSV file
csvwriter.writerow(output)
Fix for categories invented by the model
Especially with models for GPT-4, it is hardly possible to prevent outputs from being written that were not in the provided list of categories. Even with this script, despite trying different prompts indicating that only categories from may be chosen, the model picks a category not on the list 1-2% of the time.
Script for cleaning the row with duplicates:
import csv
with open('outputdirty.csv', 'r') as infile, open('outputclean.csv', 'w', newline='') as outfile:
reader = csv.reader(infile, delimiter=';')
writer = csv.writer(outfile, delimiter=';')
for row in reader:
entries = row[0].split(';')
cleaned_entries = []
for entry in entries:
if entry not in cleaned_entries:
cleaned_entries.append(entry)
cleaned_row = ';'.join(cleaned_entries)
writer.writerow([cleaned_row])
Endlessly tweaking the prompt may help prevent this, but we are now asking for 3 to 5 categories (the model almost always gives 5) while the first 3 are the most relevant. Therefore, it is easier to compare the output afterwards with the categories in the list and remove all cells where a category is present that does not match the categories. This was done with the following script (it would also have been possible to incorporate the script below in the main script):
import csv
# Read in the categories from cat.csv
with open('cat.csv', 'r') as cat_file:
cat_reader = csv.reader(cat_file)
categories = set([row[0] for row in cat_reader])
# Iterate through outputclean.csv and filter out unwanted categories
with open('outputclean.csv', 'r', newline='') as input_file, open('outputfinal.csv', 'w', newline='') as output_file:
reader = csv.reader(input_file, delimiter=';')
writer = csv.writer(output_file, delimiter=';')
for row in reader:
url = row[0]
row_categories = set(row[1:])
# Filter out unwanted categories
filtered_categories = [c for c in row_categories if any([c == category or c.startswith(category + ';') for category in categories])]
# Write the row to outputfinal.csv
writer.writerow([url] + filtered_categories)