Analysis of Data Model to Distinguish DGA and Legitimate Domains from Malicious Person Accessing the Internet
Main Article Content
Abstract
This research proposed an approach to applying Supervised Machine Learning methods to the analysis and classification between legitimate domain names and domain names generated by Domain Generation Algorithms (DGA). By human decision-making methods to define attributes in domain names analysis. For instance, the length of a domain name, the number of letters and numbers that are components of a domain name, the number of meaningful words in a domain name, and the number of pronounceable words in a domain name. The population of huge domain names consists of 50,000 legitimate domain names and 50,000 DGA. Then are split into 70% training datasets and 30% testing datasets before being fed into the classification and regression tree model (CART) using Python and Libraries. After improving the efficiency of the model with the pre-pruning method and evaluating the performance of the model with the confusion matrix, the decision tree model classifies between legitimate domain names and DGA more efficiently, which provides accuracy is 97.25%, precision is 96.25%, recall is 97.25%, and F1 score is 96.75%.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Articles in this journal are copyrighted by the x may be read and used for academic purposes, such as teaching, research, or citation, with proper credit given to the author and the journal.use or modification of the articles is prohibited without permission.
statements expressed in the articles are solely the opinions of the authors.
authors are fully responsible for the content and accuracy of their articles.
other reuse or republication requires permission from the journal."
References
Bader, J. (2015). Domain Generation Algorithms (DGAs) of Malware reimplemented in Python. [Online]. Available : https://github.com/baderj/ domain_generation_algorithms [2021, October, 23].
Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. O'Reilly Media, Inc.
Breslow, L. A., & Aha, D. W. (1997). Simplifying decision trees: A survey. The Knowledge Engineering Review, 12(01), 1-40.
Chowdhury, S. A. (2019). Domain Generation Algorithm-Dga in Malware.
Esposito, F., Malerba, D., Semeraro, G., & Kay, J. (1997). A comparative analysis of methods for pruning decision trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(5), 476-491.
Esposito, F., Malerba, D., Semeraro, G., & Tamma, V. (1999). The effects of pruning methods on the predictive accuracy of induced decision trees. Applied Stochastic Models in Business and Industry, 15(4), 277-299.
G. P., A., R., G., S., K., & Gladston, A. (2020). A machine learning framework for domain generating algorithm based malware detection. Security and Privacy, 3(6), e127.
Hubbard, D. (2016). Cisco Umbrella 1 Million. [Online]. Available : https://umbrella.cisco.com/ blog/cisco-umbrella-1-million [2021, April, 12].
Jenks, G. (2018). Python Word Segmentation. [Online]. Available : https://github.com/ grantjenks/python-wordsegment.git [2021, May, 16].
Kurkowski, J. (2020). tldextract. [Online]. Available : https://github.com/john-kurkowski/tldextract.git [2021, June, 3].
Leo Breiman, J. H. F., Richard A. Olshen, Charles J. Stone. (1984). Classification And Regression Trees s. Edition (Ed.). [Online]. Available : https://doi.org/ 10.1201/9781315139470 [2021, May, 4].
Parrish, A. (2015, November). pronouncingpy. [Online]. Available : https://github.com/aparrish/ pronouncingpy.git [2021, July, 15].
Pedregosa, F. a. V., G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.
The European Union Agency for Cybersecurity, E. (2020). List of top 15 threats. ENISA Threat Landscape [Online]. Available : https://www.enisa.europa.eu/publications/enisa-threat-landscape-2020-list-of-top-15-threats/view/++widget++form.widgets.fullReport/ @@download/ETL2020+-+ENISA+List+ of+top+15+Threats+A4.pdf [2021, July, 9].