Q1 ML for Malware Analysis 25 Points
In the week 7 lecture, we show example code of using 3 machine learning models to train and measure the performances on data in the form of feature vectors for 200 binaries of which 50% are malware and 50% are benign ware. You can find the code and the data at https://drive.google.com/drive/folders/142NMRSTifttezfPqwTkf6dlg-VWOrdaY? usp=drive_link
Download the jupyter notebook and the test.csv file from the above link, and run the code either on google colab, or using anaconda installation on your own machine.
(i) Open the test.csv file using excel or another spreadsheet program. You will find that rows 2 till 101 are labeled as malware (see the last column). Rows 102 till 201 are labeled as benign ware. Since the amount of data is so small, you can eyeball the data very quickly and find certain features (columns) that have very different values for rows marked as malware compared to their values for rows marked as benign. Those features are useful in classifying between a malware and benign ware. You will also find some features that have similar values irrespective of whether the row is labeled malware or not. Such features are useless in classification. Name 3 features that are useful for classifying and 3 that are not useful for classifying.
(ii) Explain the need for feature selection in 3-4 sentences. In other words, once we have extracted the features, why not use all the extracted features and why do we need to select a subset of features?
(iii) In the code, you will find that we computed feature correlations, and generated a heatmap for all pairwise correlation. However, we selected only those features which have high absolute value of correlation with the labels. Explain in your own words, within 2-3 sentences why this selection criterion makes sense?
(iv) In the code, we selected 7 features out of 23 features extracted. State in your own words (no more than 2-3 sentences) what might be the reason that even after removing that many features, some of the machine learning models yielded high accuracy, precision and recall?
(v) We only kept those features which have high correlation with the labels, but there may be other methods to reduce features -- explain in 2-3 sentences one possible alternative method for feature selection.
Q2 ML for Intrusion Detection 25 Points
In Week 8 lecture, we show how to use ML models to train on network packet data for intrusion detection. You can find the data and the code at https://drive.google.com/drive/folders/1BrX2QtYvTZiBIKYVrn64phV4dbqsDDpn? usp=sharing
(i) In the example code shown in week 8, we showed how scapy library is used. Write in your own words, what use of scapy library was shown? (Hint: in the rest of the code, we used pcap file for data source -- and did not use scapy library in the code -- but think about how the pcap files might have been collected).
(ii) Explain in your own words what are flows that are constructed from packets in pcap file?
(iii) In the example code shown in week 8, explain in 2-3 sentences how the flows are labeled as benign and malicious?
(iv) In the example code shown in week8, we use PCA to transform the feature vectors into transformed vectors. We then plot the first two features in the transformed feature to plot the transformed data in 2-D plots. Do your own research to find out what PCA does and explain in 2-3 sentences why PCA is useful?