• Tahlil qilingan ma’lumotlar asosida model yaratish
  • Uni qachon ishlatish kerak?




    Download 7.63 Mb.
    bet14/15
    Sana14.07.2023
    Hajmi7.63 Mb.
    #76719
    1   ...   7   8   9   10   11   12   13   14   15
    Bog'liq
    Intellektual tizimlar to\'liq
    Xakimova Shaxlo, VIRTUAL OYINLAR, 1-2-3-4 ta chorak 25 ta j b-n 11 SINF TEST(1) (1)
    Uni qachon ishlatish kerak?
    Biroq, bosqichma-bosqich regressiyadan foydalanish maqsadga muvofiq bo'lgan holatlar mavjud. Misol uchun, agar sizda modelingizga kiritish uchun juda ko'p potentsial bashoratchilar mavjud bo'lsa. Bashorat qiluvchilarni bosqichma-bosqich regressiya yordamida kamaytirish mumkin. Odatda o'rganilayotgan muammo va mavzu atrofidagi asosiy adabiyotlar va nazariyalar asosida tadqiqotingizdagi o'zgaruvchilarni qisqartirish yaxshiroqdir. Agar sizning tadqiqotingiz faqat kashfiyot bo'lsa va o'zgaruvchilarni tanlashga rahbarlik qilish uchun mavjud nazariy asos bo'lmasa. Bosqichli regressiya tadqiqotchi tahlil sifatida qo'llanilishi mumkin.
    XULOSA
    Xulosa qilish uchun, odatda, bosqichma-bosqich regressiyadan foydalanish tavsiya etilmaydi, ayniqsa tadqiqot savollaringiz nazariy bo'lsa. Biroq, agar sizda modelingizda foydalanish uchun juda ko'p potentsial o'zgaruvchilar mavjud bo'lsa. Variantlaringizni qisqartirish uchun qayta ko'rib kerak. Bu oxir-oqibat sizni o'zgaruvchilarni avtomatik tanlashga tayanmaydigan ko'proq yo'naltirilgan tadqiqotga olib kelishi mumkin.


    Tahlil qilingan ma’lumotlar asosida model yaratish
    Kirish
    Mashinani o'rganishda tasniflash - bu toifaga a'zoligi ma'lum bo'lgan kuzatuvlarni (yoki misollarni) o'z ichiga olgan o'quv ma'lumotlar to'plamiga asoslanib, yangi kuzatuv toifalar to'plamidan (pastki populyatsiyalardan) qaysi biriga tegishli ekanligini aniqlash muammosi. Tasniflash muammolariga bir nechta misollar: (a) qabul qilingan xat spam yoki organik elektron pochta ekanligini aniqlash; (b) bemorning kuzatilgan xususiyatlariga (yoshi, qon bosimi, ma'lum belgilarning mavjudligi yoki yo'qligi va boshqalar) asosida bemorga tashxis qo'yish.

    Ushbu maqolada biz Kaggle-dan bank marketing ma'lumotlar to'plamidan kimdir ba'zi atributlarga qarab depozit qo'yish yoki qilmasligini taxmin qilish uchun model yaratish uchun foydalanamiz. Biz qarorlar daraxti, Random Forest, Naive Bayes va K-Nearest Neighbours algoritmlaridan foydalangan holda 4 xil modelni yaratishga harakat qilamiz. Har bir modelni qurgandan so'ng, biz ularni baholaymiz va qaysi model bizning holatimizga mos kelishini taqqoslaymiz. Keyin GridSearch yordamida modelning giperparametrlarini sozlash orqali modelimizni optimallashtirishga harakat qilamiz. Nihoyat, biz prognoz natijasini ma'lumotlar to'plamimizdan saqlaymiz va keyin modelimizni qayta foydalanish uchun saqlaymiz.



    Boshlash uchun biz Pandas va NumPy kabi ba'zi bir asosiy kutubxonalarni yuklaymiz va keyin ushbu kutubxonalarning ba'zilariga qandaydir konfiguratsiya qilamiz.

    Data Pre-Processing
    Before we can begin to create our first model we first need to load and pre-process. This step ensure that our model will receive a good data to learn from, as they said “a model is only as good as it’s data”. The data pre-processing will be divided into few steps as explained below.
    Loading Data
    In this first step we will load our dataset that has been uploaded on my GitHub for easier process. From the dataset documentation found here we can see below are the list of column we have in our data:
    Input variables:

    1. age (numeric)

    2. job : type of job (categorical: ‘admin.’, ‘blue-collar’, ‘entrepreneur’, ’housemaid’, ‘management’, ‘retired’, ‘self-employed’, ‘services’, ‘student’, ‘technician’, ‘unemployed’, ‘unknown’)

    3. marital : marital status (categorical: ‘divorced’, ‘married’, ‘single’, ‘unknown’; note: ‘divorced’ means divorced or widowed)

    4. education (categorical: ‘basic.4y’, ‘basic.6y’, ‘basic.9y’, ‘high.school’, ‘illiterate’, ‘professional.course’, ‘university.degree’, ‘unknown’)

    5. default: has credit in default? (categorical: ‘no’, ‘yes’, ‘unknown’)

    6. housing: has housing loan? (categorical: ‘no’, ‘yes’, ‘unknown’)

    7. loan: has personal loan? (categorical: ‘no’, ‘yes’, ‘unknown’)

    8. contact: contact communication type (categorical: ‘cellular’, ‘telephone’)

    9. month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)

    10. day_of_week: last contact day of the week (categorical: ‘mon’, ‘tue’, ‘wed’, ‘thu’, ’fri’)

    11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

    12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

    13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

    14. previous: number of contacts performed before this campaign and for this client (numeric)

    15. poutcome: outcome of the previous marketing campaign (categorical: ‘failure’, ‘nonexistent’, ‘success’)

    Output variable (desired target):

    • y: has the client subscribed a term deposit? (binary: ‘yes’, ‘no’)

    According to the dataset documentation, we need to remove the ‘duration’ column because in real-case the duration is only known after the label column is known. This problem can be considered to be ‘data leakage’ where predictors include data that will not be available at the time you make predictions.




    Sinf taqsimoti
    Ma'lumotlarimizni modelga kiritishdan oldin ishonch hosil qilishimiz kerak bo'lgan yana bir muhim narsa bu ma'lumotlarning sinf taqsimotidir. Bizning holatda, kutilgan sinf ikkita natijaga, "ha" va "yo'q" ga bo'lingan bo'lsa, 50:50 sinf taqsimotini ideal deb hisoblash mumkin.


    Download 7.63 Mb.
    1   ...   7   8   9   10   11   12   13   14   15




    Download 7.63 Mb.