Advancing Roman Urdu to Urdu Transliteration using Machine Learning Techniques
Keywords:
Roman Urdu, Romanization, Urdu script, Transcription, Phonetic conversion, Roman Urdu to Urdu mapping, Script conversion, Romanized UrduAbstract
Roman Urdu is widely used by people for communication on social media platforms and daily messaging especially in Pakistan due to which writing Urdu is difficult for them. In this paper, we research different models and compare their results on a dataset of approximately 6.5 million sentences. Lastly, we suggest different modifications in the architecture of the Transformer model (which give us the best results) to improve the BLEU score of the Roman Urdu to Urdu transliteration to increase the generalization and accuracy of the transliteration enabling the transliteration of the sentence according to its context.
Downloads
References
Noureen, S. H. Huspi, and Z. Ali, “Sentiment Analysis on Roman Urdu Students’ Feedback Using Enhanced Word Embedding Technique,” Baghdad Science Journal, vol. 21, no. 2, pp. 725–739, 2024, doi: 10.21123/bsj.2024.9822.
“Event Extraction Using Word Clustering and Word Embedding for Roman Urdu,” Journal of Hunan University Natural Sciences, vol. 51, no. 1, 2024, doi: 10.55463/issn.1674-2974.51.1.13.
J. A. Husain et al., “RomanSetu: Efficiently unlocking multilingual capabilities of Large Language Models models via Romanization,” Jan. 2024, [Online]. Available: http://arxiv.org/abs/2401.14280
A. Khan, A. Ahmed, S. Jan, M. Bilal, and M. F. Zuhairi, “Abusive Language Detection in Urdu Text: Leveraging Deep Learning and Attention Mechanism,” IEEE Access, 2024, doi: 10.1109/ACCESS.2024.3370232.
V. K. Chauhan, S. Singh, and A. Sharma, “HCR-Net: a deep learning based script independent handwritten character recognition network,” Multimed Tools Appl, 2024, doi: 10.1007/s11042-024-18655-5.
S. Khalid, C. Gao, G. Orynbek, and E. Tadesse, “Constructing a women-friendly academic ecology: understanding the push and pull forces on Pakistani women academics’ research productivity,” Studies in Higher Education, 2024, doi: 10.1080/03075079.2024.2322099.
H. Raza and W. Shahzad, “End to End Urdu Abstractive Text Summarization with Dataset and Improvement in Evaluation Metric,” IEEE Access, 2024, doi: 10.1109/ACCESS.2024.3377463.
R. Shahid, A. Wali, and M. Bashir, “Next word prediction for Urdu language using deep learning models,” Comput Speech Lang, vol. 87, Aug. 2024, doi: 10.1016/j.csl.2024.101635.
“Specifications Table”, doi: 10.17632/d5j9fgbdcn.1.
M. Arshad, B. Khan, K. Khan, A. M. Qamar, and R. U. Khan, “ABMRF: An Ensemble Model for Author Profiling Based on Stylistic Features Using Roman Urdu,” Intelligent Automation & Soft Computing, vol. 0, no. 0, pp. 1–10, 2024, doi: 10.32604/iasc.2024.045402.
F. Mehmood, H. Ghafoor, M. N. Asim, M. U. Ghani, W. Mahmood, and A. Dengel, “Passion-Net: a robust precise and explainable predictor for hate speech detection in Roman Urdu text,” Neural Comput Appl, vol. 36, no. 6, pp. 3077–3100, Feb. 2024, doi: 10.1007/s00521-023-09169-6.
S. Guha, “Empires, Languages, and Scripts in the Perso-Indian World,” Comp Stud Soc Hist, 2024, doi: 10.1017/S0010417523000439.
A. S. Agrawal, B. Fazili, and P. Jyothi, “Translation Errors Significantly Impact Low-Resource Languages in Cross-Lingual Learning,” Feb. 2024, [Online]. Available: http://arxiv.org/abs/2402.02080
M. Ayaz, S. Nizamani, A. A. Chandio, and K. K. Luhana, “Detection of Roman Urdu fraud/spam SMS in Pakistan Using Machine Learning,” International Journal of Computing and Digital Systems, vol. 15, no. 1, pp. 1053–1061, 2024, doi: 10.12785/ijcds/150174.
T. Nasir and M. K. Malik, “Efficient CRNN: Towards end-to-end low resource Urdu text recognition using depthwise separable convolutions and gated recurrent units,” Inf Process Manag, vol. 61, no. 1, Jan. 2024, doi: 10.1016/j.ipm.2023.103544.
S. Kanwal, M. K. Malik, Z. Nawaz, and K. Mehmood, “SEEUNRS: Semantically-Enriched-Entity-Based Urdu News Recommendation System,” ACM Transactions on Asian and Low-Resource Language Information Processing, Jan. 2024, doi: 10.1145/3639049.
Y. A. Mohamed, A. Khanan, M. Bashir, A. H. H. M. Mohamed, M. A. E. Adiel, and M. A. Elsadig, “The Impact of Artificial Intelligence on Language Translation: A Review,” IEEE Access, vol. 12, pp. 25553–25579, 2024, doi: 10.1109/ACCESS.2024.3366802.
T. Z. Shah, M. Imran, and S. M. Ismail, “A diachronic study determining syntactic and semantic features of Urdu-English neural machine translation,” Heliyon, vol. 10, no. 1, Jan. 2024, doi: 10.1016/j.heliyon.2023.e22883.
K. Saifullah, M. I. Khan, S. Jamal, and I. H. Sarker, “Cyberbullying Text Identification based on Deep Learning and Transformer-based Language Models,” EAI Endorsed Transactions on Industrial Networks and Intelligent Systems, vol. 11, no. 1, Feb. 2024, doi: 10.4108/eetinis.v11i1.4703.
N. Srivastava, G. Hinton, A. Krizhevsky, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” 2014.
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a Method for Automatic Evaluation of Machine Translation.”
D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” Sep. 2014, [Online]. Available: http://arxiv.org/abs/1409.0473
I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to Sequence Learning with Neural Networks,” Sep. 2014, [Online]. Available: http://arxiv.org/abs/1409.3215
M.-T. Luong, I. Sutskever, Q. V. Le, O. Vinyals, and W. Zaremba, “Addressing the Rare Word Problem in Neural Machine Translation,” Oct. 2014, [Online]. Available: http://arxiv.org/abs/1410.8206
M. Alam and S. Ul Hussain, “Sequence to Sequence Networks for Roman-Urdu to Urdu Transliteration.”
Q. V. Le and T. Mikolov, “Distributed Representations of Sentences and Documents,” May 2014, [Online]. Available: http://arxiv.org/abs/1405.4053
P. Koehn, “Europarl: A Parallel Corpus for Statistical Machine Translation.” [Online]. Available: http://www.europarl.eu.int/
S. Abid, A. Bukhari, P. Sajjad, A. Paracha, and P. D. Scholar, “Portrayal of Pakistan on Urdu Websites of BBC and VOA: A Framing and Audience Perception Analysis.” [Online]. Available: http://xisdxjxsu.asia
M. Haseeb, M. F. Manzoor, M. S. Farooq, U. Farooq, and A. Abid, “A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu,” Data Brief, vol. 52, Feb. 2024, doi: 10.1016/j.dib.2023.109857.
Y. Yu et al., “Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias.” [Online]. Available: https://github.com/yueyu1030/AttrPrompt.
X. Tan, T. Qin, J. Bian, T.-Y. Liu, and Y. Bengio, “Regeneration Learning: A Learning Paradigm for Data Generation,” 2024. [Online]. Available: www.aaai.org
H. Baruah, S. R. Singh, and P. Sarmah, “Transliteration Characteristics in Romanized Assamese Language Social Media Text and Machine Transliteration,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 23, no. 2, Feb. 2024, doi: 10.1145/3639565.
M. Khan and A. Srivastava, “Sentiment Analysis of Twitter Data Using Machine Learning Techniques,” International Journal of Engineering and Management Research Peer Reviewed & Refereed Journal e, vol. 14, no. 1, 2024, doi: 10.5281/zenodo.10791485.
S. Haque, Z. Eberhart, A. Bansal, and C. McMillan, “Semantic Similarity Metrics for Evaluating Source Code Summarization,” in IEEE International Conference on Program Comprehension, IEEE Computer Society, 2022, pp. 36–47. doi: 10.1145/nnnnnnn.nnnnnnn.
C. Shaib, J. Barrow, J. Sun, A. F. Siu, B. C. Wallace, and A. Nenkova, “Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores,” Mar. 2024, [Online]. Available: http://arxiv.org/abs/2403.00553
T. Gaustad, C. A. McKellar, and M. J. Puttkammer, “Dataset for Siswati: Parallel textual data for English and Siswati and monolingual textual data for Siswati,” Data Brief, p. 110325, Mar. 2024, doi: 10.1016/J.DIB.2024.110325.
R. Van Der Goot, “Findings of the Association for Computational Linguistics Where are we Still Split on Tokenization?” [Online]. Available: https://github.com/machamp-nlp/
O. Goldman, A. Caciularu, M. Eyal, K. Cao, I. Szpektor, and R. Tsarfaty, “Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance,” Mar. 2024, [Online]. Available: http://arxiv.org/abs/2403.06265
E. Lien Bolager, I. Burak, C. Datar, Q. Sun, and F. Dietrich, “Sampling weights of deep neural networks.”
G. Zaręba, M. Zarębski, and J. Smołka, “C++ and Kotlin performance on Android-a comparative analysis Analiza porównawcza wydajności języków C++ i Kotlin na platformie Android,” 2024.
T. Oh, S. Chung, B. Lunt, R. McMahon, and R. Rutherfoord, “The roles of IT education in IoT and data analytics,” in SIGITE 2017 - Proceedings of the 18th Annual Conference on Information Technology Education, Association for Computing Machinery, Inc, Sep. 2017, pp. 39–40. doi: 10.1145/XXXXXXX.XXXXXXX.
O. Zanevych, “ADVANCING WEB DEVELOPMENT: A COMPARATIVE ANALYSIS OF MODERN FRAMEWORKS FOR REST AND GRAPHQL BACK-END SERVICES,” Grail of Science, no. 37, pp. 216–228, Mar. 2024, doi: 10.36074/grail-of-science.15.03.2024.031.
Downloads
Published
Issue
Section
License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
License Terms
Ownership and Licensing:
Authors of research papers submitted to the Asian Journal of Multidisciplinary Research & Review (AJMRR) retain the copyright of their work while granting the journal certain rights. Authors maintain ownership of the copyright and grant the journal a right of first publication. Simultaneously, authors agree to license their research papers under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) License.
License Permissions:
Under the CC BY-SA 4.0 License, others are permitted to share and adapt the work, even for commercial purposes, as long as proper attribution is given to the authors and acknowledgment is made of the initial publication in the Asian Journal of Multidisciplinary Research & Review. This license allows for the broad dissemination and utilization of research papers.
Additional Distribution Arrangements:
Authors are free to enter into separate contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., posting it to institutional repositories or publishing it in books), provided they acknowledge the initial publication of the work in the Asian Journal of Multidisciplinary Research & Review.
Online Posting:
Authors are encouraged to share their work online (e.g., in institutional repositories or on personal websites) both prior to and during the submission process to the journal. This practice can lead to productive exchanges and greater citation of published work.
Responsibility and Liability:
Authors are responsible for ensuring that their research papers do not infringe upon the copyright, privacy, or other rights of any third party. The Asian Journal of Multidisciplinary Research & Review disclaims any liability or responsibility for any copyright infringement or violation of third-party rights in the research papers.