Agronomy
2022, 12, 2096
3 of 18
several larger datasets have subsequently emerged. Refs. [
16
,
17
,
37
,
38
] proposed several
datasets containing more than 4500 samples, with 100 samples per category. Ref. [
20
] pro-
posed an open source dataset IP102 containing 75,222 samples covering 102 classes of com-
mon pests of field crops and evaluated the classification performance using hand-designed
features (including CH, LCH, Gabor, GIST, SIFT and SURF) and deep learning networks
(including AlexNet, GoogleNet, VGGNet-16 and ResNet-50), respectively, all of which
were pre-trained on ImageNet and then fine-tuned on the IP102 dataset. ResNet achieves
the best results in all metrics, while the huge difference between 49.4% accuracy and
31.5% G-mean shows the highly unbalanced nature of the IP102 dataset. Moreover, the
highest accuracy of only 49.4% indicates the challenging nature of the IP102 dataset. There-
fore, we decided to continue to advance our research on the imbalance learning problem
based on the IP02 classification system.
3. Materials and Methods
3.1. Image Acquisition
To facilitate further scientific research and practical applications, we wanted to address
the problem of limited rice pest species and samples. Therefore, we compiled a large-scale
dataset IP_RicePests for rice pest identification based on the classification system of IP102.
We collected and labeled the dataset through the following three stages: (1) image collection,
(2) image primary screening, and (3) professional data labeling.
In the image collection phase, we used the IP102 dataset as the main source for collect-
ing rice pest images and combined it with Python web crawler technology to automatically
collect a large number of images of 14 rice pests from several specialized agricultural and
insect science websites. In the initial image screening stage, we organized 2 volunteers
to manually screen the rice pest images obtained from the IP102 dataset as well as the
web crawler. The volunteers removed images with no pests or with more than one pest in
them. For example, Figure
1
shows some of the poor sample images in the IP102 dataset,
which can cause different degrees of damage to the classification accuracy, and this may be
one of the main reasons why the highest accuracy is only 49.4%. In the professional data
annotation stage, we invited 1 expert with specialized knowledge on rice to annotate each
image after the initial screening.
Agronomy 2022,
12, x
FOR PEER REVIEW
3 of 20