IMPLEMENTATION OF THE FORCED ALIGNMENT ALGORITHM FOR AUDIO AND TEXT ALIGNMENT ON THE WEBSITE FOR THE MEANING OF THE JURUMIYAH BOOK
DOI:
https://doi.org/10.58641/cest.v4i2.206Keywords:
Prealignment, Jurumiyah Book, Forced Alignment, Wav2Vec2, CTC SegmentationAbstract
Learning basic nahwu in studying yellow books still faces various challenges, especially in terms of interpreting books objectively and interactively. This study aims to build an automatic book interpretation system using forced alignment based on Wav2Vec2 + CTC Segmentation for audio and text alignment. This system is designed to provide automatic interpretation with audio and text alignment to facilitate the preparation of students in interpreting books and learning nahwu, especially jurumiyah books. The implementation process involves the extraction and pre-processing of audio and text data, audio and text are then aligned using Wav2Vec2 to produce logits output containing the number of samples, frames, and character tokens, then logits are received by CTC to calculate the alignment, manage blank tokens, calculate sequence probabilities and decoding to text to produce a timestamp array. Then the timestamp is validated and normalized and the final result is TextGrid or JSON. Then the results are integrated in an interactive website interface. The results of this study indicate that the forced alignment algorithm using the Wav2Vec2 model is capable of aligning audio and text with a fairly high level of accuracy. This makes it easier for users to understand the contents of the book through segmented audio playback per sentence or chapter. It is hoped that this research can contribute to the development of learning media for Islamic boarding schools' yellow books based on alignment technology.
References
Abdulrahman, A. O. (2026). Pitch-aware multi-feature fusion for classifying statements, questions, and exclamations in low-resource languages. Computer Speech & Language, 99, 101941. https://doi.org/10.1016/j.csl.2026.101941
Arifin, A., F., & Hajja Ristianti, D. (2022). Metode sorogan dalam meningkatkan minat dan keterampilan membaca Kitab Kuning Santri Al-Afiyah Bogor Jawa Barat. Inspiratif Pendidikan, 11(1), 24–36. https://doi.org/10.24252/ip.v11i1.29195
Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). Wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems (NeurIPS), 33, 12449–12460. https://arxiv.org/abs/20006.1147
Bootkrajang, J., Inkeaw, P., Chaijaruwanich, J., Taerungruang, S., Boonyawisit, A., Sutawong, B. J. M., Chunwijitra, V., & Taninpong, P. (2026). The development of Northern Thai dialect speech recognition system. Applied Sciences, 16(1), 160. https://doi.org/10.3390/app16010160
Creswell, J. W. (2014). Research design: Qualitative, quantitative, and mixed methods approaches (4th ed.). Sage Publications.
Dar, M. A., & Pushparaj, J. (2026). A Wav2Vec2 model-based automatic speech recognition system for low-resource Kashmiri language. International Journal of Speech Technology, 29(1), 2. https://doi.org/10.1007/s10772-025-10228-7
Febrian, N., Purwanto, P., Syarifah, L., & Muna, N. (2024). Efektivitas metode pembelajaran Sorogan Kitab Jurumiyah di Pondok Pesantren Putri Al Ma’rufiyah Tempuran. DWIJA CENDEKIA: Jurnal Riset Pedagogik, 8(1), 83. https://doi.org/10.20961/jdc.v8i1.84564
Hu, H., Tang, C., Tan, P., & Xu, H. (2026). A CTC-based speech recognition network fusing local convolution and global attention. Sensors, 26(6), 1865. https://doi.org/10.3390/s26061865
McLoughlin, I., Pham, L., Song, Y., Miao, X. X., Phan, H., Cai, P., Gu, Q., Nan, J., Song, H., & Soh, D. (2026). Spectrogram features for audio and speech analysis. Applied Sciences, 16(2), 572. https://doi.org/10.3390/app16020572
Murtafiah, N. H. (2021). Efektivitas penerapan metode sorogan Kitab Al Jurumiyah dalam meningkatkan kemampuan membaca Kitab Kuning. An Nida, 1(1), 18–25.
Nijat, M., Wei, Y., & Hamdulla, A. (2026). Perception norm for mispronunciation detection. Applied Sciences, 16(7), 3311. https://doi.org/10.3390/app16073311
Ollmann, P., Sonnleitner, E., Kurz, M., Krösche, J., & Selinger, S. (2026). Listen closely: Self-supervised phoneme tracking for children’s reading assessment. Information, 17(1), 40. https://doi.org/10.3390/info17010040
Poncelet, J., & Van Hamme, H. (2026). Leveraging broadcast media subtitle transcripts for automatic speech recognition and subtitling. Journal on Audio, Speech, and Music Processing, 2026, 20. https://doi.org/10.1186/s13636-026-00450-9
Pressman, R. S. (2010). Software engineering: A practitioner’s approach (7th ed.). McGraw-Hill.
Rahmawati, I., & Negara, T. D. W. (2021). Pelatihan Arab Pegon bagi santri baru guna meningkatkan kualitas pembelajaran Kitab Kuning di Pondok Pesantren Darul Huda Putri. MA’ALIM: Jurnal Pendidikan Islam, 2(02), 103–112. https://doi.org/10.21154/maalim.v2i2.3177
Sholikhun, M. (2018). Pembentukan karakter siswa dengan sistem boarding school. Wahana Islamika: Jurnal Studi Keislaman, 4(1), 48–64.
Spradley, J. (2006). Participant observation. Waveland Press.
Sugiyono. (2018). Metode penelitian kuantitatif, kualitatif, dan R&D. Alfabeta.
Ungureanu, R. D., & Dascalu, M. (2026). Modern speech recognition for Romanian language. Applied Sciences, 16(4), 1928. https://doi.org/10.3390/app16041928
Vander Eeckt, S., & Van Hamme, H. (2026). Efficient rehearsal for continual learning in ASR via singular value tuning. IEEE Transactions on Audio, Speech and Language Processing, 34, 978–991. https://doi.org/10.1109/TASLPRO.2026.3658931
Wahid, A. (2020). Rekayasa perangkat lunak dan model waterfall. Deepublish.
Xie, Y., Zhong, H., Lan, X., & Dong, W. (2026). Mispronunciation detection and diagnosis based on large language models. Computer Speech & Language, 99, 101942. https://doi.org/10.1016/j.csl.2026.101942
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Clean Energy and Smart Technology

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.










