STEM Dataset

Bilingual Multimodal STEM Dataset — a curated collection of 500 Math and Physics questions in Malay and English, some enriched with relevant images.

Problem statement

AI models often struggle with bilingual and multimodal STEM tasks due to a lack of high-quality, domain-specific datasets in languages like Malay and English.

SOLUTION

We created a curated dataset of 500 Math and Physics questions in Malay and English, complemented by a public leaderboard to benchmark AI model performance.

RESULT

AI teams now have a reliable resource for fine-tuning and evaluating models on real-world STEM tasks, setting a new standard for bilingual and multimodal AI development.

Overview

A Bilingual Dataset for Evaluating Reasoning Skills in STEM Subjects

‍

This dataset provides a comprehensive evaluation set for tasks assessing reasoning skills in Science, Technology, Engineering, and Mathematics (STEM) subjects. It features questions in both English and Malay, catering to a diverse audience.

‍

Key Features

Bilingual: Questions are available in English and Malay, promoting accessibility for multilingual learners.
Visually Rich: Questions are accompanied by figures to enhance understanding and support visual and contextual reasoning.
Focus on Reasoning: The dataset emphasizes questions requiring logical reasoning and problem-solving skills, as opposed to simple recall of knowledge.
Real-World Context: Questions are derived from real-world scenarios, such as past SPM (Sijil Pelajaran Malaysia) examinations, making them relatable to students.

‍

Dataset Structure

The dataset is comprised of two configurations: data_en (English) and data_ms (Malay). Both configurations share the same features and structure.

‍

Data Fields

FileName: Unique identifier for the source file (alphanumeric).
IBSN: International Standard Book Number of the source book (if available).
Subject: Academic subject (e.g., Physics, Mathematics).
Topic: Specific topic of the question within the subject (may be missing).
Questions: Main body of the question or problem statement.
Figures: List of associated image files related to the question (empty if no figures are present).
Label: Original caption or description of each image in the imgs list.
Options: Possible answer choices for the question, with keys (e.g., "A", "B", "C", "D") and corresponding text.
Answers: Correct answer to the question, represented by the key of the correct option (e.g., "C").

‍

Other Projects

Discover the work we do

View All

UI optimization of web and mobile screens for a design-to-code startup

SUPA leveraged an elite team of 12 UI experts to deliver a highly complex UI optimization curation project.

problem

A UI startup Client required a high quality of curation of optimized, responsive web and mobile screens to train their design-to-code design.

solution

SUPA curated a team of 12 elite UI designers to curate the highest quality wireframes.

result

The team successfully completed the curation dataset with the ability to scale curation to 4,000 designs per month.

Learn More

View On Maps

UI optimization of web and mobile screens for a design-to-code startup

SUPA’s strategic use of graphic design experts for generative model excellence

SUPA leveraged domain-specific talent to source and label 600k stylized vector images, solving data diversity challenges

problem

The client struggled to source and label 600,000 diverse vector images for training a stylistically versatile GenAI model, hampered by inconsistent vendor quality.

solution

SUPA rapidly scaled domain-expert graphic designers to curate, label, and sketch assets via iterative workflows and ML-aligned quality checks.

result

SUPA delivered the dataset in three months with only 3.5% rework, enabling a 37% FID score improvement in the client’s AI model for style-diverse outputs.

Learn More

View On Maps

SUPA’s strategic use of graphic design experts for generative model excellence