Examining Ethical Issues with Malware and Designing a Browser-Based Phishing Identifier using Deep Learning
This is the finalized title of our project. It's a comprehensive amalgamation of my personal interests - tech - with new areas I'm unfamiliar with - humanities. This is part two of two posts, where I will address our research process.
I have worked previously with network security and cryptography, taking a summer course and a later, more math-based course, on this subject, so I'm familiar with malware and the tech aspects of it. From Diffie-Hellman to El Gamal, the number theory behind malware has long intrigued me, but I was stunned when I learned that the vast majority of successful malware attacks come from social engineering. (This can be seen in both a positive and negative light - yes, firewalls are working, and our computer keeps out intruders. But that means the attacks on us are the successful ones.) Social engineering, as the name indicates, is a type of attack that relies heavily on human interaction, trying to trick people into allowing malware in. Examples of this are phishing emails (think spam filters), and people can learn to avoid infecting their computers through courses and by learning how to identify potential attacks. I became interested in learning about this other side of network security - this human side.
As we've mentioned before in this blog, James and I met at a summer research program, where we worked together in a machine learning lab (specifically, computer vision). Here, I first was intrigued by the beauty of artificial intelligence. The term used to conjure complex, even intimidating, images of thousands of lines of code and huge, clunky GPUs. While the latter is certainly true - I used my GPU over the summer as a footrest - the charm of AI and machine learning comes, in my opinion, from its simplicity in its similarities to humans and the way we learn, which is most often trial and error. Just as we learn through our mistakes, machine learning teaches computers to become accurate by adjusting their parameters as they measure their amounts of error.
So if people could be taught how to avoid social-engineering-based malware, could computers be taught this as well? After all, both are rooted in trial-and-error. To research this connection further, we decided to look at trends in social engineering, and specifically two: word frequencies, and image to word count ratios. After looking through several papers, we found several addressing the most common words found in phishing emails, and several other discussing how social engineers coerced people into giving up their most valuable information. We found PhishSim, given through SecurityIQ at InfoSec Institute. This held a gauntlet of phishing email templates, which we then stripped to just uniform text and ran through a word frequency program.
While creating out templates, we began to look up what deep learning model to use. We whittled possibilities to just two. The first was a Naive Bayes classifier, which assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. However, we realized that some words may appear together, like free and money, but not free and shipment. Our templates were separated by type, which was an important variable not considered by the Naive Bayes. On the other hand, in a Logistic Regression, the outcome is measured with a dichotomous variable ( only two possible outcomes). The goal is to find the best fit to describe the relationship between the variables. Our two outcomes would be phishing, or not phishing, and we could input words (word frequencies) and numbers (image:word count) as characteristics of interest, which combined, would teach a computer when to output what. We decided that the Logistic Regression would be the best fit for our identifier.
Future steps are outlined in our paper - creating a data set based on the templates and then programming the actual logistic regression. The finished product would be our Browser-Based Phishing Identifier using Deep Learning.
Creating this identifier has been incredible, to me, because of the intersection of my interests. Bridging computer science, cryptography, and artificial intelligence, there is also an element of humanities. Learning about new types of deep learning models (we used convolutional neural networks for computer vision) was a nostalgic callback to what I did over the summer, but also a strong step in continuing to learn about machine learning. I also learned about considering the human aspects, especially when creating this type of identifier. I had to learn to think like the user - what would be alarms in a phishing email? For example, the image to word count ratio was an idea of mine I didn't see in other papers, but for me, seeing a marketing email advertising a product but no product images would be a huge red flag. Especially looking through promotional emails, most don't even include more than 10 words of text. Considering this human part was something I enjoyed as well, and I look forward to more projects in the humanities.
See the previous post.
-Sohini
See the previous post.
-Sohini
No comments:
Post a Comment