Lessons from Building AI-Powered Recruitment Tools at Scale

At HRFLOW.AI, I built AI-powered tools that helped recruiters at 850+ companies identify and match candidates. Here's the honest breakdown of what works, what doesn't, and the gap between AI prototypes and production systems.

The Problem Space

Recruitment is fundamentally an information retrieval problem with human judgment layered on top. A recruiter for a senior React role receives 200 CVs. Most are unqualified. The manual review takes 3-4 hours and is prone to bias and fatigue.

AI can compress that to 20 minutes — if the system is built correctly.

CV Parsing: Harder Than It Looks

The first challenge: extracting structured data from documents that come in every format imaginable — PDF, DOCX, image scans, poorly formatted HTML exports from LinkedIn, and 15-year-old Word documents.

The architecture that worked:

class CVParser:
    def __init__(self):
        self.extractors = {
            "pdf": PDFExtractor(),
            "docx": DocxExtractor(),
            "image": OCRExtractor(),  # Tesseract + preprocessing
        }
        self.normalizer = LLMNormalizer()  # GPT-4 for structure extraction
    
    async def parse(self, file: bytes, mime_type: str) -> ParsedCV:
        # Step 1: Extract raw text
        raw_text = await self.extractors[mime_type].extract(file)
        
        # Step 2: Normalize with LLM
        structured = await self.normalizer.extract(raw_text)
        
        # Step 3: Post-process and validate
        return self.validate(structured)

The key insight: use LLMs for structure extraction, not raw text extraction. Regex and rule-based parsers fail on the long tail of CV formats. An LLM with a well-crafted prompt handles ambiguity gracefully.

The prompt that worked:

EXTRACTION_PROMPT = """
Extract structured information from this CV. Return valid JSON matching this schema:
{schema}
 
CV text:
{cv_text}
 
Rules:
- If a field is not found, return null (not an empty string)
- Normalize dates to ISO format (YYYY-MM or YYYY)
- Extract skills as they appear, don't infer unlisted skills
- If company name is ambiguous, include the full text
"""

Critical: don't infer unlisted skills — without this, the model hallucinates skills that sound plausible but aren't in the CV.

Candidate Scoring

Scoring candidates against a job description requires understanding semantic similarity, not keyword matching.

We used a bi-encoder approach (sentence-transformers) for initial ranking, then a cross-encoder for the top-N candidates:

from sentence_transformers import SentenceTransformer, CrossEncoder
 
bi_encoder = SentenceTransformer("BAAI/bge-large-en-v1.5")
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
 
async def rank_candidates(
    job_description: str,
    candidates: list[Candidate],
    top_k: int = 20,
) -> list[ScoredCandidate]:
    # Step 1: Embed job and all candidates (fast, parallel)
    job_embedding = bi_encoder.encode(job_description)
    candidate_texts = [c.to_text() for c in candidates]
    candidate_embeddings = bi_encoder.encode(candidate_texts, batch_size=64)
    
    # Step 2: Cosine similarity — fast initial filter
    similarities = cosine_similarity([job_embedding], candidate_embeddings)[0]
    top_indices = similarities.argsort()[-top_k * 3:][::-1]
    
    # Step 3: Cross-encoder rerank on top-3K subset (more accurate, slower)
    pairs = [[job_description, candidate_texts[i]] for i in top_indices]
    rerank_scores = cross_encoder.predict(pairs)
    
    # Step 4: Combine and return top_k
    scored = sorted(
        zip(top_indices, rerank_scores),
        key=lambda x: x[1],
        reverse=True
    )[:top_k]
    
    return [ScoredCandidate(candidate=candidates[i], score=s) for i, s in scored]

This two-stage approach is 10x faster than running the cross-encoder on all candidates while maintaining near-cross-encoder accuracy on the final rankings.

The Explainability Problem

Recruiters don't trust black boxes. "The AI ranked this candidate 3rd" is useless without knowing why.

I added an explanation layer that generates human-readable reasoning for each ranking:

async def explain_match(job: Job, candidate: Candidate, score: float) -> str:
    prompt = f"""
    Job: {job.title} at {job.company}
    Requirements: {job.requirements}
    
    Candidate: {candidate.name}
    Background: {candidate.summary}
    Skills: {', '.join(candidate.skills)}
    
    Match score: {score:.0%}
    
    Write 2-3 sentences explaining why this candidate is or isn't a strong match.
    Be specific. Reference actual skills and requirements from the job.
    """
    return await llm.complete(prompt, max_tokens=150)

Recruiters consistently said the explanations were more valuable than the scores.

Production Reality

Latency matters more than accuracy — a 95% accurate model that takes 30 seconds per CV is worse than a 90% model that takes 2 seconds. Recruiters are impatient.

Batch everything — parsing 200 CVs serially is slow. We built an async job queue (Celery + Redis) that parallelized parsing across workers. 200 CVs went from 4 minutes to 35 seconds.

Human-in-the-loop is not a failure — the system surfaces the top 20 candidates. A recruiter reviews them in 20 minutes instead of 3 hours. The AI doesn't replace the recruiter; it removes the tedious first filter.

Monitor for drift — embeddings trained on English CVs perform worse on French or Arabic CVs. We built per-language quality metrics and triggered retraining when accuracy dropped below threshold.

The gap between "AI demo" and "production AI system" is mostly engineering: reliability, latency, monitoring, and graceful degradation when the model is uncertain. Build for those from the start.