Rule-Based Information Extraction from Multi-format Resumes for Automated Classification

Dhiaa A. Musleh

doi:10.18280/mmep.110422

Rule-Based Information Extraction from Multi-format Resumes for Automated Classification

Dhiaa A. Musleh^*

^*Corresponding author for this work

Computer Science Department

Research output: Contribution to journal › Article › peer-review

2 Scopus citations

Abstract

Nowadays, with the expansion of the Internet, a lot of people publish their resumes on the internet and social media networks. Large companies receive hundreds of resumes per day, which comes in several formats such as Joint Photographic Experts Group (JPG), Portable Document Format (PDF) and Word files. Therefore, information extraction from resumes can be applied automatically by several methods. In this research, the important details that are taken from resumes are: name, date of birth, email, phone number, GPA, gender, nationality, and address. The private resumes dataset used is taken from different sources including open source as well as personally annotated. The processes of information extraction for resumes have been performed in different phases such as: pre-processing, converting the resumes files into PDF and information extraction by the rule-based method to extract the eight elements from resumes. To carry out the experiment, the Python language is used, particularly the spacy library and word2vec technique. Consequently, the experimental results demonstrate that the testing phase achieved 96.4% information extraction precision which is quite considerable in contrast to the techniques in the literature. The scheme is then extended to classify the resume based on the extracted information fields and exhibited classification accuracy, precision, recall and F1-score as 98.02%, 98.01%, 98% and 98%, respectively.

Original language	English
Pages (from-to)	1044-1052
Number of pages	9
Journal	Mathematical Modelling of Engineering Problems
Volume	11
Issue number	4
DOIs	https://doi.org/10.18280/mmep.110422
State	Published - Apr 2024

Keywords

document classification
information extraction
NLP
PDF resume
Python language
rule based system
text and data mining

Access to Document

10.18280/mmep.110422

Cite this

@article{f9262d0d3d4342b4bd614b2bcde8585a,

title = "Rule-Based Information Extraction from Multi-format Resumes for Automated Classification",

abstract = "Nowadays, with the expansion of the Internet, a lot of people publish their resumes on the internet and social media networks. Large companies receive hundreds of resumes per day, which comes in several formats such as Joint Photographic Experts Group (JPG), Portable Document Format (PDF) and Word files. Therefore, information extraction from resumes can be applied automatically by several methods. In this research, the important details that are taken from resumes are: name, date of birth, email, phone number, GPA, gender, nationality, and address. The private resumes dataset used is taken from different sources including open source as well as personally annotated. The processes of information extraction for resumes have been performed in different phases such as: pre-processing, converting the resumes files into PDF and information extraction by the rule-based method to extract the eight elements from resumes. To carry out the experiment, the Python language is used, particularly the spacy library and word2vec technique. Consequently, the experimental results demonstrate that the testing phase achieved 96.4\% information extraction precision which is quite considerable in contrast to the techniques in the literature. The scheme is then extended to classify the resume based on the extracted information fields and exhibited classification accuracy, precision, recall and F1-score as 98.02\%, 98.01\%, 98\% and 98\%, respectively.",

keywords = "document classification, information extraction, NLP, PDF resume, Python language, rule based system, text and data mining",

author = "Musleh, \{Dhiaa A.\}",

note = "Publisher Copyright: {\textcopyright} 2024 The author. This article is published by IIETA and is licensed under the CC BY 4.0 license (http://creativecommons.org/licenses/by/4.0/).",

year = "2024",

month = apr,

doi = "10.18280/mmep.110422",

language = "English",

volume = "11",

pages = "1044--1052",

journal = "Mathematical Modelling of Engineering Problems",

issn = "2369-0739",

number = "4",

}

TY - JOUR

T1 - Rule-Based Information Extraction from Multi-format Resumes for Automated Classification

AU - Musleh, Dhiaa A.

PY - 2024/4

Y1 - 2024/4

N2 - Nowadays, with the expansion of the Internet, a lot of people publish their resumes on the internet and social media networks. Large companies receive hundreds of resumes per day, which comes in several formats such as Joint Photographic Experts Group (JPG), Portable Document Format (PDF) and Word files. Therefore, information extraction from resumes can be applied automatically by several methods. In this research, the important details that are taken from resumes are: name, date of birth, email, phone number, GPA, gender, nationality, and address. The private resumes dataset used is taken from different sources including open source as well as personally annotated. The processes of information extraction for resumes have been performed in different phases such as: pre-processing, converting the resumes files into PDF and information extraction by the rule-based method to extract the eight elements from resumes. To carry out the experiment, the Python language is used, particularly the spacy library and word2vec technique. Consequently, the experimental results demonstrate that the testing phase achieved 96.4% information extraction precision which is quite considerable in contrast to the techniques in the literature. The scheme is then extended to classify the resume based on the extracted information fields and exhibited classification accuracy, precision, recall and F1-score as 98.02%, 98.01%, 98% and 98%, respectively.

AB - Nowadays, with the expansion of the Internet, a lot of people publish their resumes on the internet and social media networks. Large companies receive hundreds of resumes per day, which comes in several formats such as Joint Photographic Experts Group (JPG), Portable Document Format (PDF) and Word files. Therefore, information extraction from resumes can be applied automatically by several methods. In this research, the important details that are taken from resumes are: name, date of birth, email, phone number, GPA, gender, nationality, and address. The private resumes dataset used is taken from different sources including open source as well as personally annotated. The processes of information extraction for resumes have been performed in different phases such as: pre-processing, converting the resumes files into PDF and information extraction by the rule-based method to extract the eight elements from resumes. To carry out the experiment, the Python language is used, particularly the spacy library and word2vec technique. Consequently, the experimental results demonstrate that the testing phase achieved 96.4% information extraction precision which is quite considerable in contrast to the techniques in the literature. The scheme is then extended to classify the resume based on the extracted information fields and exhibited classification accuracy, precision, recall and F1-score as 98.02%, 98.01%, 98% and 98%, respectively.

KW - document classification

KW - information extraction

KW - NLP

KW - PDF resume

KW - Python language

KW - rule based system

KW - text and data mining

UR - https://www.scopus.com/pages/publications/85192390561

U2 - 10.18280/mmep.110422

DO - 10.18280/mmep.110422

M3 - Article

AN - SCOPUS:85192390561

SN - 2369-0739

VL - 11

SP - 1044

EP - 1052

JO - Mathematical Modelling of Engineering Problems

JF - Mathematical Modelling of Engineering Problems

IS - 4

ER -

Rule-Based Information Extraction from Multi-format Resumes for Automated Classification

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this