IMPLEMENTATION OF THE FORCED ALIGNMENT ALGORITHM FOR AUDIO AND TEXT ALIGNMENT ON THE WEBSITE FOR THE MEANING OF THE JURUMIYAH BOOK

Authors

  • Yoga Ari Cahyadi Universitas Sains Al-Qur'an
  • Hidayatus Sibyan Universitas Sains Al-Quran
  • Nur Hasanah Universitas Sains Al-Quran

DOI:

https://doi.org/10.58641/cest.v4i2.206

Keywords:

Prealignment, Jurumiyah Book, Forced Alignment, Wav2Vec2, CTC Segmentation

Abstract

Learning basic nahwu in studying yellow books still faces various challenges, especially in terms of interpreting books objectively and interactively. This study aims to build an automatic book interpretation system using forced alignment based on Wav2Vec2 + CTC Segmentation for audio and text alignment. This system is designed to provide automatic interpretation with audio and text alignment to facilitate the preparation of students in interpreting books and learning nahwu, especially jurumiyah books. The implementation process involves the extraction and pre-processing of audio and text data, audio and text are then aligned using Wav2Vec2 to produce logits output containing the number of samples, frames, and character tokens, then logits are received by CTC to calculate the alignment, manage blank tokens, calculate sequence probabilities and decoding to text to produce a timestamp array. Then the timestamp is validated and normalized and the final result is TextGrid or JSON. Then the results are integrated in an interactive website interface. The results of this study indicate that the forced alignment algorithm using the Wav2Vec2 model is capable of aligning audio and text with a fairly high level of accuracy. This makes it easier for users to understand the contents of the book through segmented audio playback per sentence or chapter. It is hoped that this research can contribute to the development of learning media for Islamic boarding schools' yellow books based on alignment technology.

References

Abdulrahman, A. O. (2026). Pitch-aware multi-feature fusion for classifying statements, questions, and exclamations in low-resource languages. Computer Speech & Language, 99, 101941. https://doi.org/10.1016/j.csl.2026.101941

Arifin, A., F., & Hajja Ristianti, D. (2022). Metode sorogan dalam meningkatkan minat dan keterampilan membaca Kitab Kuning Santri Al-Afiyah Bogor Jawa Barat. Inspiratif Pendidikan, 11(1), 24–36. https://doi.org/10.24252/ip.v11i1.29195

Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). Wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems (NeurIPS), 33, 12449–12460. https://arxiv.org/abs/20006.1147

Bootkrajang, J., Inkeaw, P., Chaijaruwanich, J., Taerungruang, S., Boonyawisit, A., Sutawong, B. J. M., Chunwijitra, V., & Taninpong, P. (2026). The development of Northern Thai dialect speech recognition system. Applied Sciences, 16(1), 160. https://doi.org/10.3390/app16010160

Creswell, J. W. (2014). Research design: Qualitative, quantitative, and mixed methods approaches (4th ed.). Sage Publications.

Dar, M. A., & Pushparaj, J. (2026). A Wav2Vec2 model-based automatic speech recognition system for low-resource Kashmiri language. International Journal of Speech Technology, 29(1), 2. https://doi.org/10.1007/s10772-025-10228-7

Febrian, N., Purwanto, P., Syarifah, L., & Muna, N. (2024). Efektivitas metode pembelajaran Sorogan Kitab Jurumiyah di Pondok Pesantren Putri Al Ma’rufiyah Tempuran. DWIJA CENDEKIA: Jurnal Riset Pedagogik, 8(1), 83. https://doi.org/10.20961/jdc.v8i1.84564

Hu, H., Tang, C., Tan, P., & Xu, H. (2026). A CTC-based speech recognition network fusing local convolution and global attention. Sensors, 26(6), 1865. https://doi.org/10.3390/s26061865

McLoughlin, I., Pham, L., Song, Y., Miao, X. X., Phan, H., Cai, P., Gu, Q., Nan, J., Song, H., & Soh, D. (2026). Spectrogram features for audio and speech analysis. Applied Sciences, 16(2), 572. https://doi.org/10.3390/app16020572

Murtafiah, N. H. (2021). Efektivitas penerapan metode sorogan Kitab Al Jurumiyah dalam meningkatkan kemampuan membaca Kitab Kuning. An Nida, 1(1), 18–25.

Nijat, M., Wei, Y., & Hamdulla, A. (2026). Perception norm for mispronunciation detection. Applied Sciences, 16(7), 3311. https://doi.org/10.3390/app16073311

Ollmann, P., Sonnleitner, E., Kurz, M., Krösche, J., & Selinger, S. (2026). Listen closely: Self-supervised phoneme tracking for children’s reading assessment. Information, 17(1), 40. https://doi.org/10.3390/info17010040

Poncelet, J., & Van Hamme, H. (2026). Leveraging broadcast media subtitle transcripts for automatic speech recognition and subtitling. Journal on Audio, Speech, and Music Processing, 2026, 20. https://doi.org/10.1186/s13636-026-00450-9

Pressman, R. S. (2010). Software engineering: A practitioner’s approach (7th ed.). McGraw-Hill.

Rahmawati, I., & Negara, T. D. W. (2021). Pelatihan Arab Pegon bagi santri baru guna meningkatkan kualitas pembelajaran Kitab Kuning di Pondok Pesantren Darul Huda Putri. MA’ALIM: Jurnal Pendidikan Islam, 2(02), 103–112. https://doi.org/10.21154/maalim.v2i2.3177

Sholikhun, M. (2018). Pembentukan karakter siswa dengan sistem boarding school. Wahana Islamika: Jurnal Studi Keislaman, 4(1), 48–64.

Spradley, J. (2006). Participant observation. Waveland Press.

Sugiyono. (2018). Metode penelitian kuantitatif, kualitatif, dan R&D. Alfabeta.

Ungureanu, R. D., & Dascalu, M. (2026). Modern speech recognition for Romanian language. Applied Sciences, 16(4), 1928. https://doi.org/10.3390/app16041928

Vander Eeckt, S., & Van Hamme, H. (2026). Efficient rehearsal for continual learning in ASR via singular value tuning. IEEE Transactions on Audio, Speech and Language Processing, 34, 978–991. https://doi.org/10.1109/TASLPRO.2026.3658931

Wahid, A. (2020). Rekayasa perangkat lunak dan model waterfall. Deepublish.

Xie, Y., Zhong, H., Lan, X., & Dong, W. (2026). Mispronunciation detection and diagnosis based on large language models. Computer Speech & Language, 99, 101942. https://doi.org/10.1016/j.csl.2026.101942

Downloads

Published

2026-04-30

Issue

Section

Articles