Research · Methodology · Last updated May 9, 2026 · Update cadence: Quarterly

Methodology: How Mainaka Analyzes F1 Visa Interview Data

Data sources, normalization pipeline, question categorization schema, and statistical methodology — documented in detail so that the analysis can be evaluated, audited, and reproduced.

Every claim, statistic, and pattern published on Mainaka traces back to a single underlying corpus: 6,867 publicly shared F1 visa interview accounts, normalized into 60,381 question-answer pairs, structured across India's five U.S. consulates, with explicit categorization, anonymization, and statistical methodology. This page exists to make that infrastructure visible — because data-backed claims are only as credible as the methodology that produces them.

In this document

The Dataset at a Glance
Data Sources and Provenance
Anonymization and Privacy
Normalization Pipeline
Question Categorization Schema
Statistical Methodology
Limitations and Honest Caveats
Update Cadence and Versioning
Research Principles

SECTION 01The Dataset at a Glance

Mainaka's analytical foundation is a single normalized corpus we refer to as the canonical dataset. Every article, tool, and analytics surface on Mainaka — the consulate-level question rates, refusal pattern analysis, funding-question frequency tables, the Risk Score model coefficients — derives from this one source.

The dataset's current snapshot:

6,867

Unique F1 visa interview accounts

60,381

Normalized question-answer pairs

1.77M

Words of raw interview text

Indian U.S. consulates covered

2018-25

Time range of interviews

8.79

Avg Q-A pairs per interview

Per-consulate distribution of unique interviews:

Consulate	Total Interviews	Approved	Refused	Approval Rate
Delhi	1,799	1,593	152	91.3%
Mumbai	1,791	1,534	206	88.2%
Chennai	1,377	1,205	135	89.9%
Hyderabad	1,335	1,200	102	92.2%
Kolkata	565	507	50	91.0%
Total	6,867	6,039	645	90.4%

Important context

The 90.4% approval rate in the dataset is structurally higher than current (FY2025) refusal trends. This is because applicants who experience approvals are more likely to share their accounts publicly than those who experience refusals. The dataset is most useful for studying question patterns and answer structure, not for predicting individual approval probability. The Risk Score tool's calibration accounts for this skew explicitly.

SECTION 02Data Sources and Provenance

Every interview account in the dataset was originally shared publicly by the applicant on community platforms — primarily Telegram channels where Indian F1 visa applicants document their interview experiences for the benefit of upcoming applicants. These are not scraped from private sources, leaked, or purchased.

The corpus aggregates four source archives:

Source	Account Count	Time Range	Format
Telegram public channels	5,784	2022-2025	Channel exports (CSV)
2024 community archive	541	2024-2025	Community-curated DOCX
2022 community archive	441	2022-2023	Community-curated DOCX
2023 community archive	101	2023	Community-curated DOCX

The Telegram source represents the largest and most ongoing component, aggregating accounts that Indian F1 applicants have publicly shared across multiple community platforms over the past several years. The result is a longitudinal record spanning multiple visa-policy environments — something no single consultancy or coaching center has structurally equivalent access to.

All source material was already in the public domain at the time of collection. Mainaka does not collect interview accounts from private sources, gated communities, or any non-public channel.

SECTION 03Anonymization and Privacy

Even though source accounts were posted publicly by their authors, Mainaka applies an additional anonymization layer before analysis or publication. This is a deliberate choice — the original authors generally did not anticipate their accounts being aggregated into a structured research corpus, and respect for the underlying applicants demands additional care.

The anonymization pipeline strips the following from any account before it enters the canonical dataset:

Names — applicant names, family member names, sponsor names, friend names
Specific dollar/rupee amounts — replaced with bracketed placeholders (e.g., [X] lakhs) when quoted in articles
University-specific identifying combinations — when combined with rare profile data, the university name is generalized in published quotes
Exact dates — preserved as month/year only
Counter numbers, officer descriptions — when sufficiently rare to be identifying
Telegram usernames, contact information — fully removed
Specific company names — preserved when public-domain (Amazon, Microsoft, etc.); generalized when small/regional

When refusal examples are quoted in articles, only the question text and the structural pattern of the answer are preserved — never anything that could reasonably identify the original applicant. Where doubt exists, more anonymization is applied, not less.

SECTION 04Normalization Pipeline

Raw interview accounts arrive in highly inconsistent formats — some applicants write detailed Q-by-Q dialogues, others write narrative summaries, others share only key questions and outcomes. The normalization pipeline transforms this raw heterogeneity into a single structured schema.

The canonical record schema looks like this:

{
  "source": "telegram.csv",
  "text": "<raw interview account, anonymized>",
  "consulate": "Mumbai" | "Delhi" | "Hyderabad" | "Chennai" | "Kolkata",
  "status": "approved" | "refused" | "undecided",
  "qa_pairs": [
    { "q": "Why this university?", "a": "<applicant's answer>" },
    { "q": "What does your father do?", "a": "<applicant's answer>" },
    ...
  ]
}

Every record passes through five sequential pipeline stages:

STEP 01 — INGESTION

Source format parsing

DOCX archives are parsed with python-docx; CSV exports with pandas. Each individual interview account is identified by source-specific delimiters (date markers, hashtags, "Status:" lines) and isolated as a candidate record.

STEP 02 — DEDUPLICATION

Removing repost duplicates

Telegram in particular contains heavy reposting. We compute fuzzy hashes on the first 500 characters of each account and discard near-duplicates with similarity above a threshold. This typically removes 8-12% of raw entries.

STEP 03 — METADATA EXTRACTION

Consulate, status, intake, university tagging

Pattern-matching extracts the consulate (from explicit mentions of "Mumbai consulate" / "Delhi consulate" etc.), status (approved/refused based on stated outcome plus markers like "VO returned passport"), and where present: intake season, university, profile attributes (CGPA, backlogs, work experience).

STEP 04 — Q-A PAIR EXTRACTION

Parsing dialogue structure into structured Q-A pairs

Most accounts use some form of "VO:" / "Me:" turn-taking convention. The parser identifies these markers (case-insensitive, with common variants like "Visa Officer", "Officer", "Applicant", "I said") and extracts each turn pair as a structured Q-A entry. Accounts without parseable dialogue are flagged but kept as raw text for narrative analysis.

STEP 05 — VALIDATION

Quality filtering and rejection

Records are rejected if: (a) consulate cannot be determined, (b) status cannot be determined and no outcome is implicit, (c) the account is shorter than 50 words (likely just a header without content), or (d) the content fails a basic English-language check. Roughly 4% of pre-validation entries are filtered at this stage.

The post-pipeline corpus is stored as a single JSON Lines file (final_dataset_with_qa.jsonl) with one record per line. This format supports streaming analysis without loading the full 1.77 million words into memory.

SECTION 05Question Categorization Schema

Raw question strings are too varied for direct frequency analysis. "What does your father do?", "Father's occupation?", "Tell me about your father", and "What is your father's profession?" are semantically the same question but textually distinct. The categorization schema groups these into a unified taxonomy.

The current taxonomy organizes questions into seven primary categories:

1. University & Course Choice

"Why this university?", "Why this course?", "Which universities did you apply to?", "Why this specialization?" — questions probing the applicant's deliberate selection rationale.

2. Funding & Sponsorship

"What does your father do?", "Who is sponsoring?", "How much is your loan?", "Bank balance?", "Annual income?", "Sponsor relationship?" — questions verifying the funding chain.

3. Academic Profile

"What is your CGPA?", "Tell me about your undergraduate", "Backlogs?", "Subjects studied?", "Test scores?" — questions probing academic credentials.

4. Career Plans & Return Intent

"Plans after masters?", "Will you return to India?", "Where will you work?", "Why not pursue this in India?" — questions evaluating return intent and career coherence.

5. Family & Personal Status

"Are you married?", "Any siblings in the U.S.?", "Family members in U.S.?", "Where do your parents live?" — questions evaluating personal ties and immigration intent signals.

6. University-Specific Probing

"Tell me about a professor's research?", "What is the curriculum?", "Co-op program?", "Course names?" — follow-up questions that test depth of preparation.

7. Procedural & Document

"Pass me your I-20", "Fingerprints please", "DS-160 confirmation?" — procedural turns that don't carry analytical weight but appear in transcripts.

Each question category has between 12 and 38 regular-expression patterns that capture the question's natural-language variants. The pattern library is versioned and updated quarterly as new question phrasings appear in incoming data.

SECTION 06Statistical Methodology

The statistics published on Mainaka — refusal-rate gaps, question frequencies, consulate-level patterns — follow a small set of explicit conventions:

Question frequency: percentage of interviews containing the question

For a given question category and consulate cohort, frequency is computed as:

frequency = (number of interviews where question Q appears at least once) / (total interviews in cohort) × 100

An interview that asks "Why this university?" three times still counts as 1 interview-with-question, not 3. This avoids over-weighting interviews where the officer probed the same topic multiple times.

Approved-vs-refused gap

The gap is the absolute percentage-point difference between refused-cohort frequency and approved-cohort frequency:

gap_pp = (frequency_refused − frequency_approved)

Positive gap = the question clusters more in refusals. Negative gap = the question clusters more in approvals. We report gaps in percentage points (pp) rather than relative ratios because pp is the more intuitive interpretation when frequencies are close to 50%.

Per-consulate vs aggregate

Aggregate statistics are computed across the full corpus without re-weighting for consulate sample-size differences. This means a question's aggregate frequency is approximately weighted toward Mumbai and Delhi (largest consulate cohorts) and away from Kolkata. When consulate-specific patterns matter, we always report per-consulate breakdowns alongside the aggregate.

Confidence and minimum cohort size

We do not publish patterns for sub-cohorts smaller than 50 interviews. Kolkata has the smallest cohort at 565 interviews; per-consulate refused cohorts range from 50 (Kolkata) to 206 (Mumbai). All published gap statistics meet this minimum threshold.

SECTION 07Limitations and Honest Caveats

This section exists because credible research acknowledges what it cannot do. The following limitations apply to every published claim derived from the canonical dataset:

Selection bias toward approvals

Applicants who experience approvals are more likely to share their accounts publicly. This is why our dataset shows a 90.4% approval rate while the State Department's FY2025 data shows ~39% approval. The dataset is a research tool for studying question patterns and answer structure, not a population sample for estimating approval probability.

Recall bias and self-reporting

Accounts are written by applicants after the fact, sometimes hours or days later. Verbatim transcript fidelity varies. The data is suitable for identifying high-level question patterns but not for fine-grained verbal analysis (e.g., exact word choice in an officer's question).

Outcome misattribution risk

An applicant attributing their approval or refusal to a specific question or moment may be mistaken. Officers often make decisions before the moment the applicant perceives as decisive. Our analysis examines question distributions across outcomes, not applicants' interpretations of which question was decisive.

Time-period effects

The 2018-2025 window spans materially different visa-policy environments. FY2024 (41% refusal) and FY2025 (~61% refusal) are structurally tougher than the 2018-2022 environment that generated the bulk of approval-side data. We report time-period effects when relevant, but historical patterns may not directly predict 2026 outcomes.

No causal inference

Frequency analysis identifies correlations between question presence and outcome, not causation. When we say "Why this university?" appears in 29.8% of refused interviews vs 19.3% of approved, we are reporting that correlation — not claiming the question itself causes refusals. The structural answer-quality difference is the more likely causal mechanism.

India-only F1 scope

The current dataset is exclusively Indian F1 applicants at Indian U.S. consulates. Findings should not be extrapolated to other visa types, other countries, or U.S. consulates outside India without independent validation.

SECTION 08Update Cadence and Versioning

The canonical dataset is updated quarterly. Each refresh adds new interview accounts from the same source channels and applies the full pipeline (anonymization → deduplication → categorization → validation). When pattern libraries change, prior analyses are recomputed against the updated taxonomy.

Published article statistics include a last updated date. When dataset refreshes meaningfully change a published statistic (typically defined as >1 percentage point shift), the article is updated and the change noted.

Future planned updates to the methodology itself:

Outcome verification layer — supplementing self-reported outcomes with structured outcome submissions from Mainaka users post-interview, allowing cross-validation between self-reported and outcome-tracked data
Longitudinal preparation tracking — connecting Mainaka mock-interview performance with subsequent real-interview outcomes, generating the first preparation-to-outcome calibration dataset for Indian F1 applicants
Multi-country expansion — extending the same methodology to UK Tier 4 / Student Visa, Canada Study Permit, Schengen student visas, applying the same anonymization, normalization, and categorization patterns
Embedding-based question categorization — moving from regex patterns to sentence embeddings for finer question-type clustering

SECTION 09Research Principles

The methodology above is shaped by five research principles that guide every published analysis:

Evidence over advice

When we say "Mumbai officers ask 'Father do?' in 30% of interviews," we mean we observed it in the dataset and can show the count. We do not say "Mumbai officers always ask about funding" because the data does not support always.

Disclosure over polish

This page exists. Limitations are documented. Sample sizes are disclosed. When statistics are uncertain, we say so — even when polished marketing would prefer otherwise.

Auditability over reproducibility

The methodology is documented at the level of detail necessary for evaluation and audit. The proprietary value of Mainaka rests not in the pipeline alone but in the continuously evolving curated dataset, the integrated AI evaluation systems, and the productized preparation workflows built on top — none of which are externally replicable.

Anonymization first

When ambiguity exists between preserving an interesting detail and protecting an applicant's identity, anonymization wins. Every. Time.

Honest framing of what the data CAN'T tell us

The data tells us about question patterns. It does not tell us individual approval probability, officer mental states, or causal relationships. We are explicit about which questions our methodology can answer and which it cannot.

Practice with the same data that informed this methodology

Mainaka's free AI mock interview is calibrated on this exact corpus. The mock asks questions the way each Indian consulate actually asks them — and reacts to weak answers the way real officers do. All 5 mocks are free during Mainaka's outcome-data phase.

Start Free Mock → All tools currently free — Mainaka is in its outcome-data phase, building real-world evidence before launching paid plans later in 2026.

Methodology questions can be sent to support@mainaka.com. Mainaka is not a licensed immigration attorney and this methodology is for educational and research purposes; specific case advice requires professional counsel.

Methodology version 1.0 · Published May 9, 2026 · Next planned update: August 2026.