Laporan ini merangkum kualitas dan karakteristik dataset Blended Malware Image Dataset pada level metadata citra. Fokus analisis: keseimbangan kelas, konsistensi split train/validation, ukuran file, dan statistik visual byteplot sebelum tahap training model.
| Metric | Value | Interpretation |
|---|---|---|
| Total images | 13,747 | Dataset cukup besar untuk eksperimen multi-model DL |
| Total classes | 31 | Task termasuk multi-class classification yang cukup menantang |
| Train/Val split | 9,868 / 3,879 | Split sudah tersedia dan memudahkan evaluasi yang konsisten |
| Largest class size | 501 | Kelas terbesar: Expiro |
| Smallest class size | 177 | Kelas terkecil: Dialplatform.B |
| Imbalance ratio | 2.83x | Semakin dekat ke 1, distribusi kelas semakin seimbang |
| Train vs val mean intensity gap | 2.04 | Gap kecil mengindikasikan distribusi visual split cukup serupa |
| Insight | Temuan | Implikasi |
|---|---|---|
| Keseimbangan kelas | Imbalance ratio = 2.83x (tidak seimbang) | Risiko bias ke kelas mayoritas relatif terkendali, tetapi minor class tetap perlu diperhatikan saat membaca F1 per kelas. |
| Konsistensi split | Perbedaan mean intensity train vs val = 2.04 (sangat konsisten antar split) | Validation set cukup representatif terhadap training set, sehingga evaluasi generalisasi lebih dapat dipercaya. |
| Format citra | Ukuran dominan citra berada di sekitar 298×307 | Resize ke 128×128 masih masuk akal karena struktur visual byteplot relatif konsisten antar sampel. |
| Kompleksitas task | 31 family malware dianalisis sekaligus | Masalah ini cukup menantang untuk baseline CNN dan relevan untuk menunjukkan nilai tambah transfer learning. |
| split | num_images | num_classes | mean_width | mean_height | mean_file_size_kb | mean_intensity |
|---|---|---|---|---|---|---|
| train | 9868 | 31 | 298.842 | 309.987 | 205.270 | 99.944 |
| val | 3879 | 31 | 297.714 | 304.801 | 205.742 | 97.899 |
Statistik deskriptif untuk ukuran citra, ukuran file, dan karakteristik intensitas piksel.
| feature | count | mean | std | min | 25% | 50% | 75% | max |
|---|---|---|---|---|---|---|---|---|
| width | 13747.0 | 298.5233 | 65.8955 | 64.0000 | 300.0000 | 300.0000 | 300.0000 | 768.0000 |
| height | 13747.0 | 308.5237 | 67.3313 | 216.0000 | 300.0000 | 300.0000 | 300.0000 | 999.0000 |
| aspect_ratio | 13747.0 | 0.9763 | 0.1652 | 0.1356 | 1.0000 | 1.0000 | 1.0000 | 1.4436 |
| file_size_kb | 13747.0 | 205.4033 | 72.9727 | 4.5100 | 181.2570 | 230.3050 | 250.2600 | 590.2750 |
| mean_intensity | 13747.0 | 99.3673 | 28.1164 | 1.2561 | 86.9565 | 102.4814 | 114.7210 | 226.9649 |
| std_intensity | 13747.0 | 54.3004 | 16.5963 | 4.0804 | 43.5028 | 53.3047 | 64.0338 | 93.9794 |
Visual horizontal dipakai agar seluruh nama family tetap terbaca, tidak terpotong seperti chart vertikal.
| class_name | total_images |
|---|---|
| Expiro | 501 |
| Elex | 500 |
| Androm | 500 |
| Fasong | 500 |
| Neoreklami | 500 |
| InstallCore | 500 |
| Hlux | 500 |
| Snarasite | 500 |
| Stantinko | 500 |
| VBA | 500 |
| MultiPlug | 499 |
| HackKMS | 499 |
| Dinwod | 499 |
| Sality | 499 |
| Amonetize | 497 |
| Neshta | 497 |
| Autorun | 496 |
| Vilsel | 496 |
| VBKrypt | 496 |
| Injector | 495 |
| Adposhel | 494 |
| BrowseFox | 493 |
| Regrun | 485 |
| Allaple | 478 |
| Agent | 470 |
| Fakerean | 381 |
| Lolyda.AA1 | 213 |
| C2LOP.gen!g | 200 |
| Alueron.gen!J | 198 |
| Lolyda.AA2 | 184 |
| Dialplatform.B | 177 |
Chart ini memperlihatkan apakah split train dan validation relatif proporsional untuk setiap family malware.
Heatmap ini membantu melihat hubungan antara dimensi citra, ukuran file, dan statistik intensitas.
Distribusi ini berguna untuk mengecek apakah split train dan validation berasal dari populasi visual yang relatif serupa.
| split | class_name | image_count |
|---|---|---|
| train | Adposhel | 350 |
| train | Agent | 350 |
| train | Allaple | 350 |
| train | Alueron.gen!J | 173 |
| train | Amonetize | 350 |
| train | Androm | 350 |
| train | Autorun | 350 |
| train | BrowseFox | 350 |
| train | C2LOP.gen!g | 175 |
| train | Dialplatform.B | 152 |
| train | Dinwod | 350 |
| train | Elex | 350 |
| train | Expiro | 350 |
| train | Fakerean | 306 |
| train | Fasong | 350 |
| train | HackKMS | 350 |
| train | Hlux | 350 |
| train | Injector | 350 |
| train | InstallCore | 350 |
| train | Lolyda.AA1 | 153 |
| train | Lolyda.AA2 | 159 |
| train | MultiPlug | 350 |
| train | Neoreklami | 350 |
| train | Neshta | 350 |
| train | Regrun | 350 |
| train | Sality | 350 |
| train | Snarasite | 350 |
| train | Stantinko | 350 |
| train | VBA | 350 |
| train | VBKrypt | 350 |
| train | Vilsel | 350 |
| val | Adposhel | 144 |
| val | Agent | 120 |
| val | Allaple | 128 |
| val | Alueron.gen!J | 25 |
| val | Amonetize | 147 |
| val | Androm | 150 |
| val | Autorun | 146 |
| val | BrowseFox | 143 |
| val | C2LOP.gen!g | 25 |
| val | Dialplatform.B | 25 |
| val | Dinwod | 149 |
| val | Elex | 150 |
| val | Expiro | 151 |
| val | Fakerean | 75 |
| val | Fasong | 150 |
| val | HackKMS | 149 |
| val | Hlux | 150 |
| val | Injector | 145 |
| val | InstallCore | 150 |
| val | Lolyda.AA1 | 60 |
| val | Lolyda.AA2 | 25 |
| val | MultiPlug | 149 |
| val | Neoreklami | 150 |
| val | Neshta | 147 |
| val | Regrun | 135 |
| val | Sality | 149 |
| val | Snarasite | 150 |
| val | Stantinko | 150 |
| val | VBA | 150 |
| val | VBKrypt | 146 |
| val | Vilsel | 146 |