Technology is only as inclusive as the languages it understands. For many years, several African languages have remained underrepresented in the datasets that power artificial intelligence systems. This gap has real consequences. Voice technologies fail to recognize local accents, translation tools overlook indigenous languages, and entire communities are excluded from the digital economy. Through deliberate research and innovation, Kabarak University is helping to change that narrative.
The University continues to strengthen its leadership in linguistic technology by contributing high-quality speech and text datasets for underrepresented Kenyan languages. These initiatives are positioning the institution at the forefront of inclusive AI development while supporting the preservation and growth of indigenous languages in the digital space.
The projects were led by Dr. Andrew Kipkebut, a lecturer in the School of Science, Engineering and Technology and a passionate advocate for artificial intelligence research. He coordinated the collection of Kalenjin speech data and supported the development of parallel text translation datasets. His leadership ensured that the data collected met strong academic standards while remaining culturally accurate and ethically sourced. This balance of technical precision and community engagement has been central to the success of the work.
One of the major contributions is through the African Next Voices (ANV) multilingual speech dataset, hosted on Hugging Face. Kabarak University played a key role in the Kenyan pilot data collection. The dataset includes speech samples in Dholuo, Kikuyu, Somali, Kalenjin, and Maasai. It features both scripted dialogues and spontaneous speech drawn from real-life contexts such as healthcare, education, agriculture, and customer service. These diverse speech samples are essential for training Automatic Speech Recognition systems and supporting broader speech processing and multilingual natural language processing research. The dataset is publicly accessible here: http://huggingface.co/datasets/MCAA1-MSU/anv_data_ke
In addition to speech data, the University supported the development of the Kenyan Low-Resource Language Text Dataset. This resource contains parallel sentence translations between Kiswahili and three indigenous languages: Kidaw’ida, Kalenjin, and Dholuo. Parallel datasets are critical for machine translation systems because they allow AI models to learn how meaning shifts across languages. Such tools can eventually power translation services, educational platforms, and communication applications that better serve Kenyan communities. The dataset can be accessed here: http://huggingface.co/datasets/thinkKenya/kenyan-low-resource-language-data
The combined impact of these datasets extends beyond academic research. They enable the development of speech and translation AI models for low-resource languages, support the creation of digital tools tailored to local communities, and contribute to multilingual research that preserves linguistic heritage. In practical terms, this work opens the door to more accessible information, improved digital services, and technology that reflects Kenya’s linguistic diversity.
Equally important is the ethical framework behind the projects. Both datasets were developed through community engagement, informed speaker consent, and respect for cultural context. This approach reflects the University’s commitment to research excellence, innovation, and social responsibility. The goal is not simply to build technology, but to do so in a way that respects and empowers the communities represented in the data.
Both datasets are publicly available under a Creative Commons CC BY 4.0 license, allowing researchers and developers worldwide to access, analyze, and build upon them. By making these resources open, Kabarak University is encouraging global collaboration while reinforcing its position as a leader in Kenyan language technology.
As artificial intelligence continues to shape the future of communication and innovation, inclusion will define its true impact. Through visionary leadership, rigorous research, and ethical collaboration, Kabarak University is ensuring that Kenyan languages are not left behind in the digital age.


