Analysis of Data Model to Distinguish DGA and Legitimate Domains from Malicious Person Accessing the Internet

Main Article Content

Sugrid Sorrahong
Auttapon Pomsathit
Boonruang Kerdaroondej

Abstract

This research proposed an approach to applying Supervised Machine Learning methods to the analysis and classification between legitimate domain names and domain names generated by Domain Generation Algorithms (DGA). By human decision-making methods to define attributes in domain names analysis. For instance, the length of a domain name, the number of letters and numbers that are components of a domain name, the number of meaningful words in a domain name, and the number of pronounceable words in a domain name. The population of huge domain names consists of 50,000 legitimate domain names and 50,000 DGA. Then are split into 70% training datasets and 30% testing datasets before being fed into the classification and regression tree model (CART) using Python and Libraries. After improving the efficiency of the model with the pre-pruning method and evaluating the performance of the model with the confusion matrix, the decision tree model classifies between legitimate domain names and DGA more efficiently, which provides accuracy is 97.25%, precision is 96.25%, recall is 97.25%, and F1 score is 96.75%.

Article Details

How to Cite
Sorrahong, S., Pomsathit, A., & Kerdaroondej, B. (2026). Analysis of Data Model to Distinguish DGA and Legitimate Domains from Malicious Person Accessing the Internet. The Golden Teak : Science and Technology Journal (GTSJ.), 9(2), 117–132. retrieved from https://li02.tci-thaijo.org/index.php/gts/article/view/1944
Section
Research Article

References

Bader, J. (2015). Domain Generation Algorithms (DGAs) of Malware reimplemented in Python. [Online]. Available : https://github.com/baderj/ domain_generation_algorithms [2021, October, 23].

Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. O'Reilly Media, Inc.

Breslow, L. A., & Aha, D. W. (1997). Simplifying decision trees: A survey. The Knowledge Engineering Review, 12(01), 1-40.

Chowdhury, S. A. (2019). Domain Generation Algorithm-Dga in Malware.

Esposito, F., Malerba, D., Semeraro, G., & Kay, J. (1997). A comparative analysis of methods for pruning decision trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(5), 476-491.

Esposito, F., Malerba, D., Semeraro, G., & Tamma, V. (1999). The effects of pruning methods on the predictive accuracy of induced decision trees. Applied Stochastic Models in Business and Industry, 15(4), 277-299.

G. P., A., R., G., S., K., & Gladston, A. (2020). A machine learning framework for domain generating algorithm based malware detection. Security and Privacy, 3(6), e127.

Hubbard, D. (2016). Cisco Umbrella 1 Million. [Online]. Available : https://umbrella.cisco.com/ blog/cisco-umbrella-1-million [2021, April, 12].

Jenks, G. (2018). Python Word Segmentation. [Online]. Available : https://github.com/ grantjenks/python-wordsegment.git [2021, May, 16].

Kurkowski, J. (2020). tldextract. [Online]. Available : https://github.com/john-kurkowski/tldextract.git [2021, June, 3].

Leo Breiman, J. H. F., Richard A. Olshen, Charles J. Stone. (1984). Classification And Regression Trees s. Edition (Ed.). [Online]. Available : https://doi.org/ 10.1201/9781315139470 [2021, May, 4].

Parrish, A. (2015, November). pronouncingpy. [Online]. Available : https://github.com/aparrish/ pronouncingpy.git [2021, July, 15].

Pedregosa, F. a. V., G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.

The European Union Agency for Cybersecurity, E. (2020). List of top 15 threats. ENISA Threat Landscape [Online]. Available : https://www.enisa.europa.eu/publications/enisa-threat-landscape-2020-list-of-top-15-threats/view/++widget++form.widgets.fullReport/ @@download/ETL2020+-+ENISA+List+ of+top+15+Threats+A4.pdf [2021, July, 9].