• Tuesday, 16-Sep-25 19:09:18 IST
Tech Trending :
* Semantic Search Using Text Embeddings (With ChatGPT + Python) * 🤖How OpenAI + MCP Servers Can Power the Next Generation of AI Agents for Automation * 📚 Book Recommendation System Using OpenAI Embeddings And Nomic Atlas Visualization

Semantic Search Using Text Embeddings (With ChatGPT + Python)

Semantic Search Using Text Embeddings (With ChatGPT + Python)

Traditional keyword-based search is outdated. In today’s world, users expect search engines to understand the meaning behind their queries — not just the words.

That’s where semantic search comes in. It’s a technique that uses AI-generated embeddings to understand and compare the meanings of words, phrases, or even entire documents.


🔍 Why Semantic Search?

  • Finds similar content even when keywords differ

  • Understands natural language queries

  • Great for product search, recommendation systems, document retrieval, and more


đź›  Tools Used

  • ChatGPT / OpenAI API – for generating embeddings

  • Python – our programming language

  • Pandas – for data manipulation

  • NumPy – for vector math

  • tiktoken – to estimate token usage and cost


📊 Step-by-Step Guide to Semantic Search

1. Generate Embeddings
Each word or phrase is converted into a high-dimensional vector using an OpenAI model:

df['embedding'] = df['Words'].apply(lambda x: get_embedding(x))

2. Define Cosine Similarity Function
This function compares vectors to measure how similar their meanings are:

import numpy as np def cosine_similarity(vec1, vec2): vec1 = np.array(vec1).flatten() vec2 = np.array(vec2).flatten() if vec1.shape != vec2.shape: raise ValueError("Vectors must be of same dimension") dot_product = np.dot(vec1, vec2) norm_a = np.linalg.norm(vec1) norm_b = np.linalg.norm(vec2) if norm_a == 0 or norm_b == 0: raise ValueError("Cosine similarity is not defined for zero vectors") return dot_product / (norm_a * norm_b)

3. Combine Concepts
Want to find words similar to both “Sunflower” and “Lotus”?

v1 = df['embedding'].iloc[7] # Sunflower v2 = df['embedding'].iloc[1] # Lotus v = v1 + v2 df['similarities'] = df['embedding'].apply(lambda x: cosine_similarity(x, v)) top_matches = df.sort_values('similarities', ascending=False).head(10)

4. Semantic Search from User Query

search_term = "yellow flower" search_term_vector = get_embedding(search_term) df['similarities'] = df['embedding'].apply(lambda x: cosine_similarity(x, search_term_vector)) results = df.sort_values('similarities', ascending=False).head(10)

5. Estimate Token Cost

import tiktoken words = list(df['Words']) enc = tiktoken.encoding_for_model('text-embedding-3-small') total_tokens = sum(len(enc.encode(word)) for word in words) cost_per_token = 0.02 / 1_000_000 estimated_cost = total_tokens * cost_per_token print(f"Total tokens: {total_tokens}") print(f"Estimated cost: ${estimated_cost:.10f}")

đź§­ Visual Overview

Imagine each word as a dot in space. Embeddings place similar concepts closer together.
When we calculate cosine similarity, we are measuring how “aligned” these meaning-vectors are.

You can upload a diagram to illustrate this or let me know — I can give you one.


🚀 Conclusion

Semantic search is a game-changer for modern applications. Instead of matching exact words, you’re matching meaning — thanks to the power of embeddings and vector math.

Whether you're building an internal search engine, recommendation system, or a chatbot — this technique can dramatically improve user experience.


đź’ˇ Pro Tips

  • Use text-embedding-3-large for better accuracy

  • Use vector databases like Pinecone, FAISS, or Weaviate for scale

  • Normalize vectors if needed for better cosine distance performance