Methodology: How Mainaka Analyzes F1 Visa Interview Data
Data sources, normalization pipeline, question categorization schema, and statistical methodology — documented in detail so that the analysis can be evaluated, audited, and reproduced.
SECTION 01The Dataset at a Glance
Mainaka's analytical foundation is a single normalized corpus we refer to as the canonical dataset. Every article, tool, and analytics surface on Mainaka — the consulate-level question rates, refusal pattern analysis, funding-question frequency tables, the Risk Score model coefficients — derives from this one source.
The dataset's current snapshot:
Per-consulate distribution of unique interviews:
| Consulate | Total Interviews | Approved | Refused | Approval Rate |
|---|---|---|---|---|
| Delhi | 1,799 | 1,593 | 152 | 91.3% |
| Mumbai | 1,791 | 1,534 | 206 | 88.2% |
| Chennai | 1,377 | 1,205 | 135 | 89.9% |
| Hyderabad | 1,335 | 1,200 | 102 | 92.2% |
| Kolkata | 565 | 507 | 50 | 91.0% |
| Total | 6,867 | 6,039 | 645 | 90.4% |
The 90.4% approval rate in the dataset is structurally higher than current (FY2025) refusal trends. This is because applicants who experience approvals are more likely to share their accounts publicly than those who experience refusals. The dataset is most useful for studying question patterns and answer structure, not for predicting individual approval probability. The Risk Score tool's calibration accounts for this skew explicitly.
SECTION 02Data Sources and Provenance
Every interview account in the dataset was originally shared publicly by the applicant on community platforms — primarily Telegram channels where Indian F1 visa applicants document their interview experiences for the benefit of upcoming applicants. These are not scraped from private sources, leaked, or purchased.
The corpus aggregates four source archives:
| Source | Account Count | Time Range | Format |
|---|---|---|---|
| Telegram public channels | 5,784 | 2022-2025 | Channel exports (CSV) |
| 2024 community archive | 541 | 2024-2025 | Community-curated DOCX |
| 2022 community archive | 441 | 2022-2023 | Community-curated DOCX |
| 2023 community archive | 101 | 2023 | Community-curated DOCX |
The Telegram source represents the largest and most ongoing component, aggregating accounts that Indian F1 applicants have publicly shared across multiple community platforms over the past several years. The result is a longitudinal record spanning multiple visa-policy environments — something no single consultancy or coaching center has structurally equivalent access to.
All source material was already in the public domain at the time of collection. Mainaka does not collect interview accounts from private sources, gated communities, or any non-public channel.
SECTION 03Anonymization and Privacy
Even though source accounts were posted publicly by their authors, Mainaka applies an additional anonymization layer before analysis or publication. This is a deliberate choice — the original authors generally did not anticipate their accounts being aggregated into a structured research corpus, and respect for the underlying applicants demands additional care.
The anonymization pipeline strips the following from any account before it enters the canonical dataset:
- Names — applicant names, family member names, sponsor names, friend names
- Specific dollar/rupee amounts — replaced with bracketed placeholders (e.g.,
[X] lakhs) when quoted in articles - University-specific identifying combinations — when combined with rare profile data, the university name is generalized in published quotes
- Exact dates — preserved as month/year only
- Counter numbers, officer descriptions — when sufficiently rare to be identifying
- Telegram usernames, contact information — fully removed
- Specific company names — preserved when public-domain (Amazon, Microsoft, etc.); generalized when small/regional
When refusal examples are quoted in articles, only the question text and the structural pattern of the answer are preserved — never anything that could reasonably identify the original applicant. Where doubt exists, more anonymization is applied, not less.
SECTION 04Normalization Pipeline
Raw interview accounts arrive in highly inconsistent formats — some applicants write detailed Q-by-Q dialogues, others write narrative summaries, others share only key questions and outcomes. The normalization pipeline transforms this raw heterogeneity into a single structured schema.
The canonical record schema looks like this:
"source": "telegram.csv",
"text": "<raw interview account, anonymized>",
"consulate": "Mumbai" | "Delhi" | "Hyderabad" | "Chennai" | "Kolkata",
"status": "approved" | "refused" | "undecided",
"qa_pairs": [
{ "q": "Why this university?", "a": "<applicant's answer>" },
{ "q": "What does your father do?", "a": "<applicant's answer>" },
...
]
}
Every record passes through five sequential pipeline stages:
Source format parsing
DOCX archives are parsed with python-docx; CSV exports with pandas. Each individual interview account is identified by source-specific delimiters (date markers, hashtags, "Status:" lines) and isolated as a candidate record.
Removing repost duplicates
Telegram in particular contains heavy reposting. We compute fuzzy hashes on the first 500 characters of each account and discard near-duplicates with similarity above a threshold. This typically removes 8-12% of raw entries.
Consulate, status, intake, university tagging
Pattern-matching extracts the consulate (from explicit mentions of "Mumbai consulate" / "Delhi consulate" etc.), status (approved/refused based on stated outcome plus markers like "VO returned passport"), and where present: intake season, university, profile attributes (CGPA, backlogs, work experience).
Parsing dialogue structure into structured Q-A pairs
Most accounts use some form of "VO:" / "Me:" turn-taking convention. The parser identifies these markers (case-insensitive, with common variants like "Visa Officer", "Officer", "Applicant", "I said") and extracts each turn pair as a structured Q-A entry. Accounts without parseable dialogue are flagged but kept as raw text for narrative analysis.
Quality filtering and rejection
Records are rejected if: (a) consulate cannot be determined, (b) status cannot be determined and no outcome is implicit, (c) the account is shorter than 50 words (likely just a header without content), or (d) the content fails a basic English-language check. Roughly 4% of pre-validation entries are filtered at this stage.
The post-pipeline corpus is stored as a single JSON Lines file (final_dataset_with_qa.jsonl) with one record per line. This format supports streaming analysis without loading the full 1.77 million words into memory.
SECTION 05Question Categorization Schema
Raw question strings are too varied for direct frequency analysis. "What does your father do?", "Father's occupation?", "Tell me about your father", and "What is your father's profession?" are semantically the same question but textually distinct. The categorization schema groups these into a unified taxonomy.
The current taxonomy organizes questions into seven primary categories:
1. University & Course Choice
"Why this university?", "Why this course?", "Which universities did you apply to?", "Why this specialization?" — questions probing the applicant's deliberate selection rationale.
2. Funding & Sponsorship
"What does your father do?", "Who is sponsoring?", "How much is your loan?", "Bank balance?", "Annual income?", "Sponsor relationship?" — questions verifying the funding chain.
3. Academic Profile
"What is your CGPA?", "Tell me about your undergraduate", "Backlogs?", "Subjects studied?", "Test scores?" — questions probing academic credentials.
4. Career Plans & Return Intent
"Plans after masters?", "Will you return to India?", "Where will you work?", "Why not pursue this in India?" — questions evaluating return intent and career coherence.
5. Family & Personal Status
"Are you married?", "Any siblings in the U.S.?", "Family members in U.S.?", "Where do your parents live?" — questions evaluating personal ties and immigration intent signals.
6. University-Specific Probing
"Tell me about a professor's research?", "What is the curriculum?", "Co-op program?", "Course names?" — follow-up questions that test depth of preparation.
7. Procedural & Document
"Pass me your I-20", "Fingerprints please", "DS-160 confirmation?" — procedural turns that don't carry analytical weight but appear in transcripts.
Each question category has between 12 and 38 regular-expression patterns that capture the question's natural-language variants. The pattern library is versioned and updated quarterly as new question phrasings appear in incoming data.
SECTION 06Statistical Methodology
The statistics published on Mainaka — refusal-rate gaps, question frequencies, consulate-level patterns — follow a small set of explicit conventions:
Question frequency: percentage of interviews containing the question
For a given question category and consulate cohort, frequency is computed as:
An interview that asks "Why this university?" three times still counts as 1 interview-with-question, not 3. This avoids over-weighting interviews where the officer probed the same topic multiple times.
Approved-vs-refused gap
The gap is the absolute percentage-point difference between refused-cohort frequency and approved-cohort frequency:
Positive gap = the question clusters more in refusals. Negative gap = the question clusters more in approvals. We report gaps in percentage points (pp) rather than relative ratios because pp is the more intuitive interpretation when frequencies are close to 50%.
Per-consulate vs aggregate
Aggregate statistics are computed across the full corpus without re-weighting for consulate sample-size differences. This means a question's aggregate frequency is approximately weighted toward Mumbai and Delhi (largest consulate cohorts) and away from Kolkata. When consulate-specific patterns matter, we always report per-consulate breakdowns alongside the aggregate.
Confidence and minimum cohort size
We do not publish patterns for sub-cohorts smaller than 50 interviews. Kolkata has the smallest cohort at 565 interviews; per-consulate refused cohorts range from 50 (Kolkata) to 206 (Mumbai). All published gap statistics meet this minimum threshold.
SECTION 07Limitations and Honest Caveats
This section exists because credible research acknowledges what it cannot do. The following limitations apply to every published claim derived from the canonical dataset:
Selection bias toward approvals
Applicants who experience approvals are more likely to share their accounts publicly. This is why our dataset shows a 90.4% approval rate while the State Department's FY2025 data shows ~39% approval. The dataset is a research tool for studying question patterns and answer structure, not a population sample for estimating approval probability.
Recall bias and self-reporting
Accounts are written by applicants after the fact, sometimes hours or days later. Verbatim transcript fidelity varies. The data is suitable for identifying high-level question patterns but not for fine-grained verbal analysis (e.g., exact word choice in an officer's question).
Outcome misattribution risk
An applicant attributing their approval or refusal to a specific question or moment may be mistaken. Officers often make decisions before the moment the applicant perceives as decisive. Our analysis examines question distributions across outcomes, not applicants' interpretations of which question was decisive.
Time-period effects
The 2018-2025 window spans materially different visa-policy environments. FY2024 (41% refusal) and FY2025 (~61% refusal) are structurally tougher than the 2018-2022 environment that generated the bulk of approval-side data. We report time-period effects when relevant, but historical patterns may not directly predict 2026 outcomes.
No causal inference
Frequency analysis identifies correlations between question presence and outcome, not causation. When we say "Why this university?" appears in 29.8% of refused interviews vs 19.3% of approved, we are reporting that correlation — not claiming the question itself causes refusals. The structural answer-quality difference is the more likely causal mechanism.
India-only F1 scope
The current dataset is exclusively Indian F1 applicants at Indian U.S. consulates. Findings should not be extrapolated to other visa types, other countries, or U.S. consulates outside India without independent validation.
SECTION 08Update Cadence and Versioning
The canonical dataset is updated quarterly. Each refresh adds new interview accounts from the same source channels and applies the full pipeline (anonymization → deduplication → categorization → validation). When pattern libraries change, prior analyses are recomputed against the updated taxonomy.
Published article statistics include a last updated date. When dataset refreshes meaningfully change a published statistic (typically defined as >1 percentage point shift), the article is updated and the change noted.
Future planned updates to the methodology itself:
- Outcome verification layer — supplementing self-reported outcomes with structured outcome submissions from Mainaka users post-interview, allowing cross-validation between self-reported and outcome-tracked data
- Longitudinal preparation tracking — connecting Mainaka mock-interview performance with subsequent real-interview outcomes, generating the first preparation-to-outcome calibration dataset for Indian F1 applicants
- Multi-country expansion — extending the same methodology to UK Tier 4 / Student Visa, Canada Study Permit, Schengen student visas, applying the same anonymization, normalization, and categorization patterns
- Embedding-based question categorization — moving from regex patterns to sentence embeddings for finer question-type clustering
SECTION 09Research Principles
The methodology above is shaped by five research principles that guide every published analysis:
Evidence over advice
When we say "Mumbai officers ask 'Father do?' in 30% of interviews," we mean we observed it in the dataset and can show the count. We do not say "Mumbai officers always ask about funding" because the data does not support always.
Disclosure over polish
This page exists. Limitations are documented. Sample sizes are disclosed. When statistics are uncertain, we say so — even when polished marketing would prefer otherwise.
Auditability over reproducibility
The methodology is documented at the level of detail necessary for evaluation and audit. The proprietary value of Mainaka rests not in the pipeline alone but in the continuously evolving curated dataset, the integrated AI evaluation systems, and the productized preparation workflows built on top — none of which are externally replicable.
Anonymization first
When ambiguity exists between preserving an interesting detail and protecting an applicant's identity, anonymization wins. Every. Time.
Honest framing of what the data CAN'T tell us
The data tells us about question patterns. It does not tell us individual approval probability, officer mental states, or causal relationships. We are explicit about which questions our methodology can answer and which it cannot.
Practice with the same data that informed this methodology
Mainaka's free AI mock interview is calibrated on this exact corpus. The mock asks questions the way each Indian consulate actually asks them — and reacts to weak answers the way real officers do. All 5 mocks are free during Mainaka's outcome-data phase.
Start Free Mock → All tools currently free — Mainaka is in its outcome-data phase, building real-world evidence before launching paid plans later in 2026.