OpenAI Introduces IndQA: A Culture Aware Benchmark For Indian Languages


How can we reliably test whether large language models actually understand Indian languages and culture in real world contexts? OpenAI has released IndQA, a benchmark that evaluates how well AI models understand and reason about questions that matter in Indian languages across cultural domains.

Why IndQA?

OpenAI states that about 80 percent of people worldwide do not speak English as their primary language. Yet most benchmarks that measure non English capabilities are still narrow and often rely on translation or multiple choice formats.

Benchmarks such as MMMLU and MGSM are now near saturation at the top end, where strong models cluster near similar scores. This makes it hard to see meaningful progress and does not test whether models understand local context, history and everyday life.

India is OpenAI’s starting point for new region focused benchmarks. India has about 1 billion people who do not use English as their primary language, 22 official languages with at least 7 spoken by more than 50 million people, and it is ChatGPT’s second largest market.

Dataset, Languages And Domains

IndQA evaluates knowledge and reasoning about Indian culture and everyday life in Indian languages. The benchmark spans 2,278 questions across 12 languages and 10 cultural domains, created with 261 domain experts from across India.

The cultural domains are Architecture and Design, Arts and Culture, Everyday Life, Food and Cuisine, History, Law and Ethics, Literature and Linguistics, Media and Entertainment, Religion and Spirituality, and Sports and Recreation. Items are written natively in Bengali, English, Hindi, Hinglish, Kannada, Marathi, Odia, Telugu, Gujarati, Malayalam, Punjabi and Tamil. Hinglish is included to reflect common code switching in Indian conversations.

Each datapoint contains four components, a culturally grounded prompt in an Indian language, an English translation for auditability, rubric criteria for grading and an ideal answer that encodes expert expectations.

Rubric Based Evaluation Pipeline

IndQA uses a rubric based grading procedure instead of exact match accuracy. For each question, domain experts define multiple criteria that describe what a strong answer should include or avoid and assign a weight to each criterion.

A model based grader checks the candidate response against these criteria and marks which ones are satisfied. The final score is the sum of weights for satisfied criteria divided by the total possible score. This behaves like grading a short exam answer, it supports partial credit and captures nuance and cultural correctness, not only surface token overlap.

https://openai.com/index/introducing-indqa/

Construction Process And Adversarial Filtering

OpenAI describes a four step construction pipeline:

First, they partnered with organizations in India to recruit experts across 10 domains. These experts are native level speakers of the target language and English and have deep subject expertise. They wrote difficult, reasoning heavy prompts anchored in regional context, such as literature, food history, law or media.

Second, they applied adversarial filtering. Every draft question was evaluated with OpenAI’s strongest models at creation time, GPT-4o, OpenAI o3, GPT-4.5 and, partially after public launch, GPT-5. Only questions where a majority of these models failed to produce acceptable answers were kept. This preserves headroom so that future model improvements show up clearly on IndQA.

Third, experts provided detailed criteria for grading each question, similar to an exam rubric. These criteria are reused whenever another model is evaluated on IndQA.

Fourth, experts wrote ideal answers and English translations and then performed peer review and iterative revisions until they signed off on quality.

Measuring Progress On Indian Languages

OpenAI uses IndQA to evaluate recent frontier models and to chart progress over the last couple years on Indian languages. They report that model performance has improved significantly on IndQA while still leaving substantial room for improvement. Results are stratified by language and by domain and include comparisons of GPT-5 Thinking High with other frontier systems.

Key Takeaways

  1. IndQA is a culturally grounded Indic benchmark: IndQA evaluates how well AI models understand and reason about questions that matter in Indian languages, across culturally specific domains, rather than only testing translation or multiple choice accuracy.
  2. The dataset is expert built and reasonably large: The benchmark contains 2,278 questions across 12 languages and 10 cultural domains, developed in collaboration with 261 domain experts from across India, covering areas like architecture, everyday life, food, history and religion.
  3. Evaluation is rubric based, not exact match: Each datapoint bundles a native language prompt, an English translation, a detailed grading rubric and an ideal answer, and model outputs are graded by a model based system that checks weighted expert defined criteria, which enables partial credit and nuanced cultural evaluation.
  4. Questions are adversarially filtered against OpenAI’s strongest models: Draft questions were filtered by running GPT 4o, OpenAI o3, GPT 4.5 and partially GPT 5, and keeping only those items where most of these models failed, which preserves headroom for future models on IndQA.

IndQA is a timely step because it targets a real gap, most existing multilingual benchmarks over index on English content and translation style tasks while India has diverse high resource and low resource languages. IndQA brings expert curated, rubric based evaluation for questions that matter in Indian cultural contexts, and uses adversarial filtering against GPT 4o, OpenAI o3, GPT 4.5 and GPT 5 to preserve headroom for frontier models. This launch makes IndQA a practical north star for evaluating Indian language reasoning in modern AI systems.


Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *