• HINDI
  •    
  • Saturday, 17-Jan-26 06:22:16 IST
Tech Trending :
* 🤖How OpenAI + MCP Servers Can Power the Next Generation of AI Agents for Automation * 📚 Book Recommendation System Using OpenAI Embeddings And Nomic Atlas Visualization

Semantic Search Using Text Embeddings (With ChatGPT + Python)

Contents

Table of Contents

    Contents
    Semantic Search Using Text Embeddings (With ChatGPT + Python)

    Semantic Search Using Text Embeddings (With ChatGPT + Python)

    Traditional keyword-based search is outdated. In today’s world, users expect search engines to understand the meaning behind their queries — not just the words.

    That’s where semantic search comes in. It’s a technique that uses AI-generated embeddings to understand and compare the meanings of words, phrases, or even entire documents.


    🔍 Why Semantic Search?

    • Finds similar content even when keywords differ

    • Understands natural language queries

    • Great for product search, recommendation systems, document retrieval, and more


    🛠 Tools Used

    • ChatGPT / OpenAI API – for generating embeddings

    • Python – our programming language

    • Pandas – for data manipulation

    • NumPy – for vector math

    • tiktoken – to estimate token usage and cost


    📊 Step-by-Step Guide to Semantic Search

    1. Generate Embeddings
    Each word or phrase is converted into a high-dimensional vector using an OpenAI model:

    df['embedding'] = df['Words'].apply(lambda x: get_embedding(x))

    2. Define Cosine Similarity Function
    This function compares vectors to measure how similar their meanings are:

    import numpy as np def cosine_similarity(vec1, vec2): vec1 = np.array(vec1).flatten() vec2 = np.array(vec2).flatten() if vec1.shape != vec2.shape: raise ValueError("Vectors must be of same dimension") dot_product = np.dot(vec1, vec2) norm_a = np.linalg.norm(vec1) norm_b = np.linalg.norm(vec2) if norm_a == 0 or norm_b == 0: raise ValueError("Cosine similarity is not defined for zero vectors") return dot_product / (norm_a * norm_b)

    3. Combine Concepts
    Want to find words similar to both “Sunflower” and “Lotus”?

    v1 = df['embedding'].iloc[7] # Sunflower v2 = df['embedding'].iloc[1] # Lotus v = v1 + v2 df['similarities'] = df['embedding'].apply(lambda x: cosine_similarity(x, v)) top_matches = df.sort_values('similarities', ascending=False).head(10)

    4. Semantic Search from User Query

    search_term = "yellow flower" search_term_vector = get_embedding(search_term) df['similarities'] = df['embedding'].apply(lambda x: cosine_similarity(x, search_term_vector)) results = df.sort_values('similarities', ascending=False).head(10)

    5. Estimate Token Cost

    import tiktoken words = list(df['Words']) enc = tiktoken.encoding_for_model('text-embedding-3-small') total_tokens = sum(len(enc.encode(word)) for word in words) cost_per_token = 0.02 / 1_000_000 estimated_cost = total_tokens * cost_per_token print(f"Total tokens: {total_tokens}") print(f"Estimated cost: ${estimated_cost:.10f}")

    🧭 Visual Overview

    Imagine each word as a dot in space. Embeddings place similar concepts closer together.
    When we calculate cosine similarity, we are measuring how “aligned” these meaning-vectors are.

    You can upload a diagram to illustrate this or let me know — I can give you one.


    🚀 Conclusion

    Semantic search is a game-changer for modern applications. Instead of matching exact words, you’re matching meaning — thanks to the power of embeddings and vector math.

    Whether you're building an internal search engine, recommendation system, or a chatbot — this technique can dramatically improve user experience.


    💡 Pro Tips

    • Use text-embedding-3-large for better accuracy

    • Use vector databases like Pinecone, FAISS, or Weaviate for scale

    • Normalize vectors if needed for better cosine distance performance