TY - JOUR
T1 - Rule-Based Information Extraction from Multi-format Resumes for Automated Classification
AU - Musleh, Dhiaa A.
N1 - Publisher Copyright:
© 2024 The author. This article is published by IIETA and is licensed under the CC BY 4.0 license (http://creativecommons.org/licenses/by/4.0/).
PY - 2024/4
Y1 - 2024/4
N2 - Nowadays, with the expansion of the Internet, a lot of people publish their resumes on the internet and social media networks. Large companies receive hundreds of resumes per day, which comes in several formats such as Joint Photographic Experts Group (JPG), Portable Document Format (PDF) and Word files. Therefore, information extraction from resumes can be applied automatically by several methods. In this research, the important details that are taken from resumes are: name, date of birth, email, phone number, GPA, gender, nationality, and address. The private resumes dataset used is taken from different sources including open source as well as personally annotated. The processes of information extraction for resumes have been performed in different phases such as: pre-processing, converting the resumes files into PDF and information extraction by the rule-based method to extract the eight elements from resumes. To carry out the experiment, the Python language is used, particularly the spacy library and word2vec technique. Consequently, the experimental results demonstrate that the testing phase achieved 96.4% information extraction precision which is quite considerable in contrast to the techniques in the literature. The scheme is then extended to classify the resume based on the extracted information fields and exhibited classification accuracy, precision, recall and F1-score as 98.02%, 98.01%, 98% and 98%, respectively.
AB - Nowadays, with the expansion of the Internet, a lot of people publish their resumes on the internet and social media networks. Large companies receive hundreds of resumes per day, which comes in several formats such as Joint Photographic Experts Group (JPG), Portable Document Format (PDF) and Word files. Therefore, information extraction from resumes can be applied automatically by several methods. In this research, the important details that are taken from resumes are: name, date of birth, email, phone number, GPA, gender, nationality, and address. The private resumes dataset used is taken from different sources including open source as well as personally annotated. The processes of information extraction for resumes have been performed in different phases such as: pre-processing, converting the resumes files into PDF and information extraction by the rule-based method to extract the eight elements from resumes. To carry out the experiment, the Python language is used, particularly the spacy library and word2vec technique. Consequently, the experimental results demonstrate that the testing phase achieved 96.4% information extraction precision which is quite considerable in contrast to the techniques in the literature. The scheme is then extended to classify the resume based on the extracted information fields and exhibited classification accuracy, precision, recall and F1-score as 98.02%, 98.01%, 98% and 98%, respectively.
KW - document classification
KW - information extraction
KW - NLP
KW - PDF resume
KW - Python language
KW - rule based system
KW - text and data mining
UR - https://www.scopus.com/pages/publications/85192390561
U2 - 10.18280/mmep.110422
DO - 10.18280/mmep.110422
M3 - Article
AN - SCOPUS:85192390561
SN - 2369-0739
VL - 11
SP - 1044
EP - 1052
JO - Mathematical Modelling of Engineering Problems
JF - Mathematical Modelling of Engineering Problems
IS - 4
ER -