Advancing Roman Urdu to Urdu Transliteration using Machine Learning Techniques

Authors

  • Ahsan Ahmad Senior Machine Learning Engineer Author
  • Mohsin Ali Ahmad Student Author

Keywords:

Roman Urdu, Romanization, Urdu script, Transcription, Phonetic conversion, Roman Urdu to Urdu mapping, Script conversion, Romanized Urdu

Abstract

Roman Urdu is widely used by people for communication on social media platforms and daily messaging especially in Pakistan due to which writing Urdu is difficult for them. In this paper, we research different models and compare their results on a dataset of approximately 6.5 million sentences. Lastly, we suggest different modifications in the architecture of the Transformer model (which give us the best results) to improve the BLEU score of the Roman Urdu to Urdu transliteration to increase the generalization and accuracy of the transliteration enabling the transliteration of the sentence according to its context.

Downloads

Download data is not yet available.

References

Noureen, S. H. Huspi, and Z. Ali, “Sentiment Analysis on Roman Urdu Students’ Feedback Using Enhanced Word Embedding Technique,” Baghdad Science Journal, vol. 21, no. 2, pp. 725–739, 2024, doi: 10.21123/bsj.2024.9822.

“Event Extraction Using Word Clustering and Word Embedding for Roman Urdu,” Journal of Hunan University Natural Sciences, vol. 51, no. 1, 2024, doi: 10.55463/issn.1674-2974.51.1.13.

J. A. Husain et al., “RomanSetu: Efficiently unlocking multilingual capabilities of Large Language Models models via Romanization,” Jan. 2024, [Online]. Available: http://arxiv.org/abs/2401.14280

A. Khan, A. Ahmed, S. Jan, M. Bilal, and M. F. Zuhairi, “Abusive Language Detection in Urdu Text: Leveraging Deep Learning and Attention Mechanism,” IEEE Access, 2024, doi: 10.1109/ACCESS.2024.3370232.

V. K. Chauhan, S. Singh, and A. Sharma, “HCR-Net: a deep learning based script independent handwritten character recognition network,” Multimed Tools Appl, 2024, doi: 10.1007/s11042-024-18655-5.

S. Khalid, C. Gao, G. Orynbek, and E. Tadesse, “Constructing a women-friendly academic ecology: understanding the push and pull forces on Pakistani women academics’ research productivity,” Studies in Higher Education, 2024, doi: 10.1080/03075079.2024.2322099.

H. Raza and W. Shahzad, “End to End Urdu Abstractive Text Summarization with Dataset and Improvement in Evaluation Metric,” IEEE Access, 2024, doi: 10.1109/ACCESS.2024.3377463.

R. Shahid, A. Wali, and M. Bashir, “Next word prediction for Urdu language using deep learning models,” Comput Speech Lang, vol. 87, Aug. 2024, doi: 10.1016/j.csl.2024.101635.

“Specifications Table”, doi: 10.17632/d5j9fgbdcn.1.

M. Arshad, B. Khan, K. Khan, A. M. Qamar, and R. U. Khan, “ABMRF: An Ensemble Model for Author Profiling Based on Stylistic Features Using Roman Urdu,” Intelligent Automation & Soft Computing, vol. 0, no. 0, pp. 1–10, 2024, doi: 10.32604/iasc.2024.045402.

F. Mehmood, H. Ghafoor, M. N. Asim, M. U. Ghani, W. Mahmood, and A. Dengel, “Passion-Net: a robust precise and explainable predictor for hate speech detection in Roman Urdu text,” Neural Comput Appl, vol. 36, no. 6, pp. 3077–3100, Feb. 2024, doi: 10.1007/s00521-023-09169-6.

S. Guha, “Empires, Languages, and Scripts in the Perso-Indian World,” Comp Stud Soc Hist, 2024, doi: 10.1017/S0010417523000439.

A. S. Agrawal, B. Fazili, and P. Jyothi, “Translation Errors Significantly Impact Low-Resource Languages in Cross-Lingual Learning,” Feb. 2024, [Online]. Available: http://arxiv.org/abs/2402.02080

M. Ayaz, S. Nizamani, A. A. Chandio, and K. K. Luhana, “Detection of Roman Urdu fraud/spam SMS in Pakistan Using Machine Learning,” International Journal of Computing and Digital Systems, vol. 15, no. 1, pp. 1053–1061, 2024, doi: 10.12785/ijcds/150174.

T. Nasir and M. K. Malik, “Efficient CRNN: Towards end-to-end low resource Urdu text recognition using depthwise separable convolutions and gated recurrent units,” Inf Process Manag, vol. 61, no. 1, Jan. 2024, doi: 10.1016/j.ipm.2023.103544.

S. Kanwal, M. K. Malik, Z. Nawaz, and K. Mehmood, “SEEUNRS: Semantically-Enriched-Entity-Based Urdu News Recommendation System,” ACM Transactions on Asian and Low-Resource Language Information Processing, Jan. 2024, doi: 10.1145/3639049.

Y. A. Mohamed, A. Khanan, M. Bashir, A. H. H. M. Mohamed, M. A. E. Adiel, and M. A. Elsadig, “The Impact of Artificial Intelligence on Language Translation: A Review,” IEEE Access, vol. 12, pp. 25553–25579, 2024, doi: 10.1109/ACCESS.2024.3366802.

T. Z. Shah, M. Imran, and S. M. Ismail, “A diachronic study determining syntactic and semantic features of Urdu-English neural machine translation,” Heliyon, vol. 10, no. 1, Jan. 2024, doi: 10.1016/j.heliyon.2023.e22883.

K. Saifullah, M. I. Khan, S. Jamal, and I. H. Sarker, “Cyberbullying Text Identification based on Deep Learning and Transformer-based Language Models,” EAI Endorsed Transactions on Industrial Networks and Intelligent Systems, vol. 11, no. 1, Feb. 2024, doi: 10.4108/eetinis.v11i1.4703.

N. Srivastava, G. Hinton, A. Krizhevsky, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” 2014.

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a Method for Automatic Evaluation of Machine Translation.”

D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” Sep. 2014, [Online]. Available: http://arxiv.org/abs/1409.0473

I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to Sequence Learning with Neural Networks,” Sep. 2014, [Online]. Available: http://arxiv.org/abs/1409.3215

M.-T. Luong, I. Sutskever, Q. V. Le, O. Vinyals, and W. Zaremba, “Addressing the Rare Word Problem in Neural Machine Translation,” Oct. 2014, [Online]. Available: http://arxiv.org/abs/1410.8206

M. Alam and S. Ul Hussain, “Sequence to Sequence Networks for Roman-Urdu to Urdu Transliteration.”

Q. V. Le and T. Mikolov, “Distributed Representations of Sentences and Documents,” May 2014, [Online]. Available: http://arxiv.org/abs/1405.4053

P. Koehn, “Europarl: A Parallel Corpus for Statistical Machine Translation.” [Online]. Available: http://www.europarl.eu.int/

S. Abid, A. Bukhari, P. Sajjad, A. Paracha, and P. D. Scholar, “Portrayal of Pakistan on Urdu Websites of BBC and VOA: A Framing and Audience Perception Analysis.” [Online]. Available: http://xisdxjxsu.asia

M. Haseeb, M. F. Manzoor, M. S. Farooq, U. Farooq, and A. Abid, “A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu,” Data Brief, vol. 52, Feb. 2024, doi: 10.1016/j.dib.2023.109857.

Y. Yu et al., “Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias.” [Online]. Available: https://github.com/yueyu1030/AttrPrompt.

X. Tan, T. Qin, J. Bian, T.-Y. Liu, and Y. Bengio, “Regeneration Learning: A Learning Paradigm for Data Generation,” 2024. [Online]. Available: www.aaai.org

H. Baruah, S. R. Singh, and P. Sarmah, “Transliteration Characteristics in Romanized Assamese Language Social Media Text and Machine Transliteration,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 23, no. 2, Feb. 2024, doi: 10.1145/3639565.

M. Khan and A. Srivastava, “Sentiment Analysis of Twitter Data Using Machine Learning Techniques,” International Journal of Engineering and Management Research Peer Reviewed & Refereed Journal e, vol. 14, no. 1, 2024, doi: 10.5281/zenodo.10791485.

S. Haque, Z. Eberhart, A. Bansal, and C. McMillan, “Semantic Similarity Metrics for Evaluating Source Code Summarization,” in IEEE International Conference on Program Comprehension, IEEE Computer Society, 2022, pp. 36–47. doi: 10.1145/nnnnnnn.nnnnnnn.

C. Shaib, J. Barrow, J. Sun, A. F. Siu, B. C. Wallace, and A. Nenkova, “Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores,” Mar. 2024, [Online]. Available: http://arxiv.org/abs/2403.00553

T. Gaustad, C. A. McKellar, and M. J. Puttkammer, “Dataset for Siswati: Parallel textual data for English and Siswati and monolingual textual data for Siswati,” Data Brief, p. 110325, Mar. 2024, doi: 10.1016/J.DIB.2024.110325.

R. Van Der Goot, “Findings of the Association for Computational Linguistics Where are we Still Split on Tokenization?” [Online]. Available: https://github.com/machamp-nlp/

O. Goldman, A. Caciularu, M. Eyal, K. Cao, I. Szpektor, and R. Tsarfaty, “Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance,” Mar. 2024, [Online]. Available: http://arxiv.org/abs/2403.06265

E. Lien Bolager, I. Burak, C. Datar, Q. Sun, and F. Dietrich, “Sampling weights of deep neural networks.”

G. Zaręba, M. Zarębski, and J. Smołka, “C++ and Kotlin performance on Android-a comparative analysis Analiza porównawcza wydajności języków C++ i Kotlin na platformie Android,” 2024.

T. Oh, S. Chung, B. Lunt, R. McMahon, and R. Rutherfoord, “The roles of IT education in IoT and data analytics,” in SIGITE 2017 - Proceedings of the 18th Annual Conference on Information Technology Education, Association for Computing Machinery, Inc, Sep. 2017, pp. 39–40. doi: 10.1145/XXXXXXX.XXXXXXX.

O. Zanevych, “ADVANCING WEB DEVELOPMENT: A COMPARATIVE ANALYSIS OF MODERN FRAMEWORKS FOR REST AND GRAPHQL BACK-END SERVICES,” Grail of Science, no. 37, pp. 216–228, Mar. 2024, doi: 10.36074/grail-of-science.15.03.2024.031.

Downloads

Published

22-04-2024

How to Cite

Ahmad, Ahsan, and Mohsin Ali Ahmad. “Advancing Roman Urdu to Urdu Transliteration Using Machine Learning Techniques”. Asian Journal of Multidisciplinary Research & Review, vol. 5, no. 2, Apr. 2024, pp. 108-27, https://ajmrr.org/journal/article/view/9.