HolisticInfoSec™: R

Showing posts with label R. Show all posts

Sunday, June 03, 2018

toolsmith #133 - Anomaly Detection & Threat Hunting with Anomalize

When, in October and November's toolsmith posts, I redefined DFIR under the premise of Deeper Functionality for Investigators in R, I discovered a "tip of the iceberg" scenario. To that end, I'd like to revisit the concept with an additional discovery and opportunity. In reality, this is really a case of DFIR (Deeper Functionality for Investigators in R) within the general practice of the original and paramount DFIR (Digital Forensics/Incident Response).
As discussed here before, those of us in the DFIR practice, and Blue Teaming at large, are overwhelmed by data and scale. Success truly requires algorithmic methods. If you're not already invested here I have an immediately applicable case study for you in tidy anomaly detection with anomalize.
First, let me give credit where entirely due for the work that follows. Everything I discuss and provide is immediately derivative from Business Science (@bizScienc), specifically Matt Dancho (@mdancho84). He created anomalize, "a tidy anomaly detection algorithm that’s time-based (built on top of tibbletime) and scalable from one to many time series," when a client asked Business Science to build an open source anomaly detection algorithm that suited their needs. I'd say he responded beautifully, when his blogpost hit my radar via R-Bloggers it lived as an open tab in my browser for more than a month until generating this toolsmith. Please consider Matt's post a mandatory read as step one of the process here. I'll quote Matt specifically before shifting context: "Our client had a challenging problem: detecting anomalies in time series on daily or weekly data at scale. Anomalies indicate exceptional events, which could be increased web traffic in the marketing domain or a malfunctioning server in the IT domain. Regardless, it’s important to flag these unusual occurrences to ensure the business is running smoothly. One of the challenges was that the client deals with not one time series but thousands that need to be analyzed for these extreme events."
Key takeaway: Detecting anomalies in time series on daily or weekly data at scale. Anomalies indicate exceptional events.
Now shift context with me to security-specific events and incidents, as the pertain to security monitoring, incident response, and threat hunting. In my November 2017 post, recall that I discussed Time Series Regression with the Holt-Winters method and a focus on seasonality and trends. Unfortunately, I couldn't share the code for how we applied TSR, but pointed out alternate methods, including Seasonal and Trend Decomposition using Loess (STL):

Handles any type of seasonality ~ can change over time
Smoothness of the trend-cycle can also be controlled by the user
Robust to outliers

Here now, Matt has created a means to immediately apply the STL method, along with the Twitter method (reference page), as part of his time_decompose() function, one of three functions specific to the anomalize package. In addition to time_decompose(), which separates the time series into seasonal, trend, and remainder components, anomalize includes:

anomalize(): Applies anomaly detection methods to the remainder component.
time_recompose(): Calculates limits that separate the “normal” data from the anomalies

The methods used in anomalize(), including IQR and GESD are described in Matt's reference page. Matt ultimately set out to build a scalable adaptation of Twitter's AnomalyDetection package in order to address his client's challenges in dealing with not one time series but thousands needing to be analyzed for extreme events. You'll note that Matt describes anomalize using a dataset of the daily download counts of the 15 tidyverse packages from CRAN, relevant as he leverages the tidyverse package. I initially toyed with tweaking Matt's demo to model downloads for security-specific R packages (yes, there are such things) from CRAN, including RAppArmor, net.security, securitytxt, and cymruservices, the latter two courtesy of Bob Rudis (@hrbrmstr) of our beloved Data-Driven Security: Analysis, Visualization and Dashboards. Alas, this was a mere rip and replace, and really didn't exhibit the use of anomalize in a deserving, varied, truly security-specific context. That said, I was able to generate immediate results doing so, as seen in Figure 1.

Figure 1: Initial experiment

As an initial experiment you can replace packages names with those of your choosing in tidyverse_cran_downloads.R, run it in R Studio, then tweak variable names and labels in the code per Matt's README page.

I wanted to run anomalize against a real security data scenario, so I went back to the dataset from the original DFIR articles where I'd utilized counts of 4624 Event IDs per day, per user, on a given set of servers. As utilized originally, I'd represented results specific to only one device and user, but herein is the beauty of anomalize. We can achieve quick results across multiple times series (multiple systems/users). This premise is but one of many where time series analysis and seasonality can be applied to security data.
I originally tried to write log data from log.csv straight to an anomalize.R script with logs = read_csv("log.csv") into a tibble (ready your troubles with tibbles jokes), which was not being parsed accurately, particularly time attributes. To correct this, from Matt's Github I grabbed tidyverse_cran_downloads.R, and modified it as follows:
This helped greatly thanks to the tibbletime package, which is "is an extension that allows for the creation of time aware tibbles. Some immediate advantages of this include: the ability to perform time based subsetting on tibbles, quickly summarising and aggregating results by time periods. Guess what, Matt wrote tibbletime too. :-)
I then followed Matt's sequence as he posted on Business Science, but with my logs defined as a function in Security_Access_Logs_Function.R. Following, I'll give you the code snippets, as revised from Matt's examples, followed by their respective results specific to processing my Event ID 4624 daily count log.
First, let's summarize daily login counts across three servers over four months.
The result is evident in Figure 2.

Figure 2: Server logon counts visualized

Next, let's determine which daily download logons are anomalous with Matt's three main functions, time_decompose(), anomalize(), and time_recompose(), along with the visualization function, plot_anomalies(), across the same three servers over four months.
The result is revealed in Figure 3.

Figure 3: Security event log anomalies

Following Matt's method using Twitter’s AnomalyDetection package, combining time_decompose(method = "twitter") with anomalize(method = "gesd"), while adjusting the trend = "4 months" to adjust median spans, we'll focus only on SERVER-549521.
In Figure 4, you'll note that there are anomalous logon counts on SERVER-549521 in June.

Figure 4: SERVER-549521 logon anomalies with Twitter & GESD methods

We can compare the Twitter (time_decompose) and GESD (anomalize) methods with the STL (time_decompose) and IQR (anomalize) methods, which use different decomposition and anomaly detection approaches.
Again, we note anomalies in June, as seen in Figure 5.

Figure 5: SERVER-549521 logon anomalies with STL & IQR methods

Obviously, the results are quite similar, as one would hope. Finally, let use Matt's plot_anomaly_decomposition() for visualizing the inner workings of how algorithm detects anomalies in the remainder for SERVER-549521.
The result is a four part visualization, including observed, season, trend, and remainder as seen in Figure 6.

Figure 6: Decomposition for SERVER-549521 Logins

I'm really looking forward to putting these methods to use at a much larger scale, across a far broader event log dataset. I firmly assert that blue teams are already way behind in combating automated adversary tactics and problems of sheer scale, so...much...data. It's only with tactics such as Matt's anomalize, and others of its ilk, that defenders can hope to succeed. Be sure the watch Matt's YouTube video on anomalize, Business Science is building a series of videos in addition, so keep an eye out there and on their GitHub for more great work that we can apply a blue team/defender's context to.

All the code snippets are in my GitHubGist here, and the sample log file, a single R script, and a Jupyter Notebook are all available for you on my GitHub under toolsmith_r. I hope you find anomalize as exciting and useful as I have, great work by Matt, looking forward to see what's next from Business Science.

Cheers...until next time.

Sunday, November 19, 2017

toolsmith #129 - DFIR Redefined: Deeper Functionality for Investigators with R - Part 2

You can have data without information, but you cannot have information without data. ~Daniel Keys Moran

Here we resume our discussion of DFIR Redefined: Deeper Functionality for Investigators with R as begun in Part 1.
First, now that my presentation season has wrapped up, I've posted the related material on the Github for this content. I've specifically posted the most recent version as presented at SecureWorld Seattle, which included Eric Kapfhammer's contributions and a bit of his forward thinking for next steps in this approach.
When we left off last month I parted company with you in the middle of an explanation of analysis of emotional valence, or the "the intrinsic attractiveness (positive valence) or averseness (negative valence) of an event, object, or situation", using R and the Twitter API. It's probably worth your time to go back and refresh with the end of Part 1. Our last discussion point was specific to the popularity of negative tweets versus positive tweets with a cluster of emotionally neutral retweets, two positive retweets, and a load of negative retweets. This type of analysis can quickly give us better understanding of an attacker collective's sentiment, particularly where the collective is vocal via social media. Teeing off the popularity of negative versus positive sentiment, we can assess the actual words fueling such sentiment analysis. It doesn't take us much R code to achieve our goal using the apply family of functions. The likes of apply, lapply, and sapply allow you to manipulate slices of data from matrices, arrays, lists and data frames in a repetitive way without having to use loops. We use code here directly from Michael Levy, Social Scientist, and his Playing with Twitter Data post.

polWordTables =
sapply(pol, function(p) {
words = c(positiveWords = paste(p[[1]]$pos.words[[1]], collapse = ' '),
negativeWords = paste(p[[1]]$neg.words[[1]], collapse = ' '))
gsub('-', '', words) # Get rid of nothing found's "-"
}) %>%
apply(1, paste, collapse = ' ') %>%
stripWhitespace() %>%
strsplit(' ') %>%
sapply(table)

par(mfrow = c(1, 2))
invisible(
lapply(1:2, function(i) {
dotchart(sort(polWordTables[[i]]), cex = .5)
mtext(names(polWordTables)[i])
}))

The result is a tidy visual representation of exactly what we learned at the end of Part 1, results as noted in Figure 1.

Figure 1: Positive vs negative words

Content including words such as killed, dangerous, infected, and attacks are definitely more interesting to readers than words such as good and clean. Sentiment like this could definitely be used to assess potential attacker outcomes and behaviors just prior, or in the midst of an attack, particularly in DDoS scenarios. Couple sentiment analysis with the ability to visualize networks of retweets and mentions, and you could zoom in on potential leaders or organizers. The larger the network node, the more retweets, as seen in Figure 2.

Figure 2: Who is retweeting who?

Remember our initial premise, as described in Part 1, was that attacker groups often use associated hashtags and handles, and the minions that want to be "part of" often retweet and use the hashtag(s). Individual attackers either freely give themselves away, or often become easily identifiable or associated, via Twitter. Note that our dominant retweets are for @joe4security, @HackRead, @defendmalware (not actual attackers, but bloggers talking about attacks, used here for example's sake). Figure 3 shows us who is mentioning who.

Figure 3: Who is mentioning who?

Note that @defendmalware mentions @HackRead. If these were actual attackers it would not be unreasonable to imagine a possible relationship between Twitter accounts that are actively retweeting and mentioning each other before or during an attack. Now let's assume @HackRead might be a possible suspect and you'd like to learn a bit more about possible additional suspects. In reality @HackRead HQ is in Milan, Italy. Perhaps Milan then might be a location for other attackers. I can feed in Twittter handles from my retweet and mentions network above, query the Twitter API with very specific geocode, and lock it within five miles of the center of Milan.
The results are immediate per Figure 4.

Figure 4: GeoLocation code and results

Obviously, as these Twitter accounts aren't actual attackers, their retweets aren't actually pertinent to our presumed attack scenario, but they definitely retweeted @computerweekly (seen in retweets and mentions) from within five miles of the center of Milan. If @HackRead were the leader of an organization, and we believed that associates were assumed to be within geographical proximity, geolocation via the Twitter API could be quite useful. Again, these are all used as thematic examples, no actual attacks should be related to any of these accounts in any way.

Fast Frugal Trees (decision trees) for prioritizing criticality

With the abundance of data, and often subjective or biased analysis, there are occasions where a quick, authoritative decision can be quite beneficial. Fast-and-frugal trees (FFTs) to the rescue. FFTs are simple algorithms that facilitate efficient and accurate decisions based on limited information.
Nathaniel D. Phillips, PhD created FFTrees for R to allow anyone to easily create, visualize and evaluate FFTs. Malcolm Gladwell has said that "we are suspicious of rapid cognition. We live in a world that assumes that the quality of a decision is directly related to the time and effort that went into making it.” FFTs, and decision trees at large, counter that premise and aid in the timely, efficient processing of data with the intent of a quick but sound decision. As with so much of information security, there is often a direct correlation with medical, psychological, and social sciences, and the use of FFTs is no different. Often, predictive analysis is conducted with logistic regression, used to "describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables." Would you prefer logistic regression or FFTs?

Figure 5: Thanks, I'll take FFTs

Here's a text book information security scenario, often rife with subjectivity and bias. After a breach, and subsequent third party risk assessment that generated a ton of CVSS data, make a fast decision about what treatments to apply first. Because everyone loves CVSS.

Figure 6: CVSS meh

Nothing like a massive table, scored by base, impact, exploitability, temporal, environmental, modified impact, and overall scores, all assessed by a third party assessor who may not fully understand the complexities or nuances of your environment. Let's say our esteemed assessor has decided that there are 683 total findings, of which 444 are non-critical and 239 are critical. Will FFTrees agree? Nay! First, a wee bit of R code.

library("FFTrees")
cvss <- c:="" coding="" csv="" p="" r="" read.csv="" rees="">cvss.fft <- data="cvss)</p" fftrees="" formula="critical">plot(cvss.fft, what = "cues")
plot(cvss.fft,
main = "CVSS FFT",
decision.names = c("Non-Critical", "Critical"))

Guess what, the model landed right on impact and exploitability as the most important inputs, and not just because it's logically so, but because of their position when assessed for where they fall in the area under the curve (AUC), where the specific curve is the receiver operating characteristic (ROC). The ROC is a "graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied." As for the AUC, accuracy is measured by the area under the ROC curve where an area of 1 represents a perfect test and an area of .5 represents a worthless test. Simply, the closer to 1, the better. For this model and data, impact and exploitability are the most accurate as seen in Figure 7.

Figure 7: Cue rankings prefer impact and exploitability

The fast and frugal tree made its decision where impact and exploitability with scores equal or less than 2 were non-critical and exploitability greater than 2 was labeled critical, as seen in Figure 8.

Figure 8: The FFT decides

Ah hah! Our FFT sees things differently than our assessor. With a 93% average for performance fitting (this is good), our tree, making decisions on impact and exploitability, decides that there are 444 non-critical findings and 222 critical findings, a 17 point differential from our assessor. Can we all agree that mitigating and remediating critical findings can be an expensive proposition? If you, with just a modicum of data science, can make an authoritative decision that saves you time and money without adversely impacting your security posture, would you count it as a win? Yes, that was rhetorical.

Note that the FFTrees function automatically builds several versions of the same general tree that make different error trade-offs with variations in performance fitting and false positives. This gives you the option to test variables and make potentially even more informed decisions within the construct of one model. Ultimately, fast frugal trees make very fast decisions on 1 to 5 pieces of information and ignore all other information. In other words, "FFTrees are noncompensatory, once they make a decision based on a few pieces of information, no additional information changes the decision."

Finally, let's take a look at monitoring user logon anomalies in high volume environments with Time Series Regression (TSR). Much of this work comes courtesy of Eric Kapfhammer, our lead data scientist on our Microsoft Windows and Devices Group Blue Team. The ideal Windows Event ID for such activity is clearly 4624: an account was successfully logged on. This event is typically one of the top 5 events in terms of volume in most environments, and has multiple type codes including Network, Service, and RemoteInteractive.

User accounts will begin to show patterns over time, in aggregate, including:

Seasonality: day of week, patch cycles,
Trend: volume of logons increasing/decreasing over time
Noise: randomness

You could look at 4624 with a Z-score model, which sets a threshold based on the number of standard deviations away from an average count over a given period of time, but this is a fairly simple model. The higher the value, the greater the degree of “anomalousness”.

Preferably, via Time Series Regression (TSR), your feature set is more rich:

Statistical method for predicting a future response based on the response history (known as autoregressive dynamics) and the transfer of dynamics from relevant predictors
Understand and predict the behavior of dynamic systems from experimental or observational data
Commonly used for modeling and forecasting of economic, financial and biological systems

How to spot the anomaly in a sea of logon data?

“Triple Exponential Smoothing (Holt-Winters method) is one of many algorithms used to forecast data points in a series, provided that the series is “seasonal”, i.e. repetitive over some period.”
Winters improved on Holts double exponential smoothing by adding seasonality in 1960 and published Forecasting sales by exponentially weighted moving averages

Let's imagine our user, DARPA-549521, in the SUPERSECURE domain, with 90 days of aggregate 4624 Type 10 events by day.

Figure 9: User logon data

With 210 line of R, including comments, log read, file output, and graphing we can visualize and alert on DARPA-549521's data as seen in Figure 10.

Figure 10: User behavior outside the confidence interval

We can detect when a user’s account exhibits changes in their seasonality as it relates to a confidence interval established (learned) over time. In this case, on 27 AUG 2017, the user topped her threshold of 19 logons thus triggering an exception. Now imagine using this model to spot anomalous user behavior across all users and you get a good feel for the model's power.
Eric points out that there are, of course, additional options for modeling including:

Seasonal and Trend Decomposition using Loess (STL)

Handles any type of seasonality ~ can change over time
Smoothness of the trend-cycle can also be controlled by the user
Robust to outliers

Classification and Regression Trees (CART)

Supervised learning approach: teach trees to classify anomaly / non-anomaly
Unsupervised learning approach: focus on top-day hold-out and error check

Neural Networks

LSTM / Multiple time series in combination

These are powerful next steps in your capabilities, I want you to be brave, be creative, go forth and add elements of data science and visualization to your practice. R and Python are well supported and broadly used for this mission and can definitely help you detect attackers faster, contain incidents more rapidly, and enhance your in-house detection and remediation mechanisms.
All the code as I can share is here; sorry, I can only share the TSR example without the source.
All the best in your endeavors!
Cheers...until next time.

Tuesday, October 17, 2017

toolsmith #128 - DFIR Redefined: Deeper Functionality for Investigators with R - Part 1

“To competently perform rectifying security service, two critical incident response elements are necessary: information and organization.” ~ Robert E. Davis

I've been presenting DFIR Redefined: Deeper Functionality for Investigators with R across the country at various conference venues and thought it would helpful to provide details for readers.
The basic premise?
Incident responders and investigators need all the help they can get.
Let me lay just a few statistics on you, from Secure360.org's The Challenges of Incident Response, Nov 2016. Per their respondents in a survey of security professionals:

38% reported an increase in the number of hours devoted to incident response
42% reported an increase in the volume of incident response data collected
39% indicated an increase in the volume of security alerts

In short, according to Nathan Burke, “It’s just not mathematically possible for companies to hire a large enough staff to investigate tens of thousands of alerts per month, nor would it make sense.”
The 2017 SANS Incident Response Survey, compiled by Matt Bromiley in June, reminds us that “2016 brought unprecedented events that impacted the cyber security industry, including a myriad of events that raised issues with multiple nation-state attackers, a tumultuous election and numerous government investigations.” Further, "seemingly continuous leaks and data dumps brought new concerns about malware, privacy and government overreach to the surface.”
Finally, the survey shows that IR teams are:

Detecting the attackers faster than before, with a drastic improvement in dwell time
Containing incidents more rapidly
Relying more on in-house detection and remediation mechanisms

To that end, what concepts and methods further enable handlers and investigators as they continue to strive for faster detection and containment? Data science and visualization sure can’t hurt. How can we be more creative to achieve “deeper functionality”? I propose a two-part series on Deeper Functionality for Investigators with R with the following DFIR Redefined scenarios:

Have you been pwned?
Visualization for malicious Windows Event Id sequences
How do your potential attackers feel, or can you identify an attacker via sentiment analysis?
Fast Frugal Trees (decision trees) for prioritizing criticality

R is “100% focused and built for statistical data analysis and visualization” and “makes it remarkably simple to run extensive statistical analysis on your data and then generate informative and appealing visualizations with just a few lines of code.”

With R you can interface with data via file ingestion, database connection, APIs and benefit from a wide range of packages and strong community investment.
From the Win-Vector Blog, per John Mount “not all R users consider themselves to be expert programmers (many are happy calling themselves analysts). R is often used in collaborative projects where there are varying levels of programming expertise.”
I propose that this represents the vast majority of us, we're not expert programmers, data scientists, or statisticians. More likely, we're security analysts re-using code for our own purposes, be it red team or blue team. With a very few lines of R investigators might be more quickly able to reach conclusions.
All the code described in the post can be found on my GitHub.

Have you been pwned?

This scenario I covered in an earlier post, I'll refer you to Toolsmith Release Advisory: Steph Locke's HIBPwned R package.

Visualization for malicious Windows Event Id sequences

Windows Events by Event ID present excellent sequenced visualization opportunities. A hypothetical scenario for this visualization might include multiple failed logon attempts (4625) followed by a successful logon (4624), then various malicious sequences. A fantastic reference paper built on these principle is Intrusion Detection Using Indicators of Compromise Based on Best Practices and Windows Event Logs. An additional opportunity for such sequence visualization includes Windows processes by parent/children. One R library particularly well suited to is TraMineR: Trajectory Miner for R. This package is for mining, describing and visualizing sequences of states or events, and more generally discrete sequence data. It's primary aim is the analysis of biographical longitudinal data in the social sciences, such as data describing careers or family trajectories, and a BUNCH of other categorical sequence data. Somehow though, the project page somehow fails to mention malicious Windows Event ID sequences. :-) Consider Figures 1 and 2 as retrieved from above mentioned paper. Figure 1 are text sequence descriptions, followed by their related Windows Event IDs in Figure 2.

Figure 1

Figure 2

Taking related log data, parsing and counting it for visualization with R would look something like Figure 3.

Figure 3

How much R code does it take to visualize this data with a beautiful, interactive sunburst visualization? Three lines, not counting white space and comments, as seen in the video below.

A screen capture of the resulting sunburst also follows as Figure 4.

Figure 4

How do your potential attackers feel, or can you identify an attacker via sentiment analysis?

Do certain adversaries or adversarial communities use social media? Yes

As such, can social media serve as an early warning system, if not an actual sensor? Yes

Are certain adversaries, at times, so unaware of OpSec on social media that you can actually locate them or correlate against other geo data? Yes

Some excellent R code to assess Twitter data with includes Jeff Gentry's twitteR and rtweet to interface with the Twitter API.

twitteR: provides access to the Twitter API. Most functionality of the API is supported, with a bias towards API calls that are more useful in data analysis as opposed to daily interaction.
Rtweet: R client for interacting with Twitter’s REST and stream API’s.

The code and concepts here are drawn directly from Michael Levy, PhD UC Davis: Playing With Twitter.

Here's the scenario: DDoS attacks from hacktivist or chaos groups.

Attacker groups often use associated hashtags and handles and the minions that want to be "part of" often retweet and use the hashtag(s). Individual attackers either freely give themselves away, or often become easily identifiable or associated, via Twitter. As such, here's a walk-through of analysis techniques that may help identify or better understand the motives of certain adversaries and adversary groups. I don't use actual adversary handles here, for obvious reasons. I instead used a DDoS news cycle and journalist/bloggers handles as exemplars. For this example I followed the trail of the WireX botnet, comprised mainly of Android mobile devices utilized to launch a high-impact DDoS extortion campaign against multiple organizations in the travel and hospitality sector in August 2017. I started with three related hashtags:

#DDOS
#Android
#WireX

We start with all related Tweets by day and time of day. The code is succinct and efficient, as noted in Figure 5.

Figure 5

The result is a pair of graphs color coded by tweets and retweets per Figure 6.

Figure 6

This gives you an immediate feels for spikes in interest by day as well as time of day, particularly with attention to retweets.

Want to see what platforms potential adversaries might be tweeting from? No problem, code in Figure 7.

Figure 7

The result in the scenario ironically indicates that the majority of related tweets using our hashtags of interest are coming from Androids per Figure 8. :-)

Figure 8

Now to the analysis of emotional valence, or the "the intrinsic attractiveness (positive valence) or averseness (negative valence) of an event, object, or situation."
orig$text[which.max(orig$emotionalValence)] tells us that the most positive tweet is "A bunch of Internet tech companies had to work together to clean up #WireX #Android #DDoS #botnet."
orig$text[which.min(orig$emotionalValence)] tells us that "Dangerous #WireX #Android #DDoS #Botnet Killed by #SecurityGiants" is the most negative tweet.
Interesting right? Almost exactly the same message, but very different valence.
How do we measure emotional valence changes over the day? Four lines later...
filter(orig, mday(created) == 29) %>%
ggplot(aes(created, emotionalValence)) +
geom_point() +
geom_smooth(span = .5)

...and we have Figure 9, which tell us that most tweets about WireX were emotionally neutral on 29 AUG 2017, around 0800 we saw one positive tweet, a more negative tweets overall in the morning.

Figure 9

Another line of questioning to consider: which tweets are more often retweeted, positive or negative? As you can imagine with information security focused topics, negativity wins the day.
Three lines of R...
ggplot(orig, aes(x = emotionalValence, y = retweetCount)) +
geom_point(position = 'jitter') +
geom_smooth()
...and we learn just how popular negative tweets are in Figure 10.

Figure 10

There are cluster of emotionally neutral retweets, two positive retweets, and a load of negative retweets. This type of analysis can quickly lead to a good feel for the overall sentiment of an attacker collective, particularly one with less opsec and more desire for attention via social media.
In Part 2 of DFIR Redefined: Deeper Functionality for Investigators with R we'll explore this scenario further via sentiment analysis and Twitter data, as well as Fast Frugal Trees (decision trees) for prioritizing criticality.
Let me know if you have any questions on the first part of this series via @holisticinfosec or russ at holisticinfosec dot org.
Cheers...until next time.

Sunday, May 21, 2017

Toolsmith #125: ZAPR - OWASP ZAP API R Interface

It is my sincere hope that when I say OWASP Zed Attack Proxy (ZAP), you say "Hell, yeah!" rather than "What's that?". This publication has been a longtime supporter, and so many brilliant contibutors and practitioners have lent to OWASP ZAPs growth, in addition to @psiinon's extraordinary project leadership. OWASP ZAP has been 1st or 2nd in the last four years of @ToolsWatch best tool survey's for a damned good reason. OWASP ZAP usage has been well documented and presented over the years, and the wiki gives you tons to consider as you explore OWASP ZAP user scenarios.
One of the more recent scenarios I've sought to explore recently is use of the OWASP ZAP API. The OWASP ZAP API is also well documented, more than enough detail to get you started, but consider a few use case scenarios.
First, there is a functional, clean OWASP ZAP API UI, that gives you a viewer's perspective as you contemplate programmatic opportunities. OWASP ZAP API interaction is URL based, and you can invoke both access views and invoke actions. Explore any component and you'll immediately find related views or actions. Drilling into to core via http://localhost:8067/UI/core/ (I run OWASP ZAP on 8067, your install will likely be different), gives me a ton to choose from.

You'll need your API key in order to build queries. You can find yours via Tools | Options | API | API Key. As an example, drill into numberOfAlerts (baseurl ), which gets the number of alerts, optionally filtering by URL. You'll then be presented with the query builder, where you can enter you key, define specific parameter, and decide your preferred output format including JSON, HTML, and XML.

Sure, you'll receive results in your browser, this query will provide answers in HTML tables, but these aren't necessarily optimal for programmatic data consumption and manipulation. That said, you learn the most important part of this lesson, a fully populated OWASP ZAP API GET URL: http://localhost:8067/HTML/core/view/numberOfAlerts/?zapapiformat=HTML&apikey=2v3tebdgojtcq3503kuoq2lq5g&formMethod=GET&baseurl=.
This request would return

in HTML. Very straightforward and easy to modify per your preferences, but HTML results aren't very machine friendly. Want JSON results instead? Just swap out HTML with JSON in the URL, or just choose JSON in the builder. I'll tell you than I prefer working with JSON when I use the OWASP ZAP API via the likes of R. It's certainly the cleanest, machine-consumable option, though others may argue with me in favor of XML.
Allow me to provide you an example with which you can experiment, one I'll likely continue to develop against as it's kind of cool for active reporting on OWASP ZAP scans in flight or on results when session complete. Note, all my code, crappy as it may be, is available for you on GitHub. I mean to say, this is really v0.1 stuff, so contribute and debug as you see fit. It's also important to note that OWASP ZAP needs to be running, either with an active scanning session, or a stored session you saved earlier. I scanned my website, holisticinfosec.org, and saved the session for regular use as I wrote this. You can even see reference to the saved session by location below.
R users are likely aware of Shiny, a web application framework for R, and its dashboard capabilities. I also discovered that rCharts are designed to work interactively and beautifully within Shiny.
R includes packages that make parsing from JSON rather straightforward, as I learned from Zev Ross. RJSONIO makes it as easy as fromJSON("http://localhost:8067/JSON/core/view/alerts/?zapapiformat=JSON&apikey=2v3tebdgojtcq3503kuoq2lq5g&formMethod=GET&baseurl=&start=&count=")
to pull data from the OWASP ZAP API. We use the fromJSON "function and its methods to read content in JSON format and de-serializes it into R objects", where the ZAP API URL is that content.
I further parsed alert data using Zev's grabInfo function and organized the results into a data frame (ZapDataDF). I then further sorted the alert content from ZapDataDF into objects useful for reporting and visualization. Within each alert objects are values such as the risk level, the alert message, the CWE ID, the WASC ID, and the Plugin ID. Defining each of these values into parameter useful to R is completed with the likes of:

I then combined all those results into another data frame I called reportDF, the results of which are seen in the figure below.

reportDF results

Now we've got some content we can pivot on.
First, let's summarize the findings and present them in their resplendent glory via ZAPR: OWASP ZAP API R Interface.
Code first, truly simple stuff it is:

Summary overview API calls

You can see that we're simply using RJSONIO's fromJSON to make specific ZAP API call. The results are quite tidy, as seen below.

ZAPR Overview

One of my favorite features in Shiny is the renderDataTable function. When utilized in a Shiny dashboard, it makes filtering results a breeze, and thus is utilized as the first real feature in ZAPR. The code is tedious, review or play with it from GitHub, but the results should speak for themselves. I filtered the view by CWE ID 89, which in this case is a bit of a false positive, I have a very flat web site, no database, thank you very much. Nonetheless, good to have an example of what would definitely be a high risk finding.

Alert filtering

Alert filtering is nice, I'll add more results capabilities as I develop this further, but visualizations are important too. This is where rCharts really come to bear in Shiny as they are interactive. I've used the simplest examples, but you'll get the point. First, a few, wee lines of R as seen below.

Chart code

The results are much more satisfying to look at, and allow interactivity. Ramnath Vaidyanathan has done really nice work here. First, OWASP ZAP alerts pulled via the API are counted by risk in a bar chart.

Alert counts

As I moused over Medium, we can see that there were specifically 17 results from my OWASP ZAP scan of holisticinfosec.org.
Our second visualization are the CWE ID results by count, in an oft disdained but interactive pie chart (yes, I have some work to do on layout).

CWE IDs by count

As we learned earlier, I only had one CWE ID 89 hit during the session, and the visualization supports what we saw in the data table.
The possibilities are endless to pull data from the OWASP ZAP API and incorporate the results into any number of applications or report scenarios. I have a feeling there is a rich opportunity here with PowerBI, which I intend to explore. All the code is here, along with the OWASP ZAP session I refer to, so you can play with it for yourself. You'll need OWASP ZAP, R, and RStudio to make it all work together, let me know if you have questions or suggestions.
Cheers, until next time.

Sunday, July 10, 2016

Toolsmith Release Advisory: Steph Locke's HIBPwned R package

I'm a bit slow on this one but better late than never. Steph dropped her HIBPwned R package on CRAN at the beginning of June, and it's well worth your attention. HIBPwned is an R package that wraps Troy Hunt's HaveIBeenPwned.com API, useful to check if you have an account that has been compromised in a data breach. As one who has been "pwned" no less than three times via three different accounts thanks to LinkedIn, Patreon, and Adobe, I love Troy's site and have visited it many times.

When I spotted Steph's wrapper on R-Bloggers, I was quite happy as a result.
Steph built HIBPwned to allow users to:

Set up your own notification system for account breaches of myriad email addresses & user names that you have
Check for compromised company email accounts from within your company Active Directory
Analyse past data breaches and produce reports and visualizations

I installed it from Visual Studio with R Tools via install.packages("HIBPwned", repos="http://cran.rstudio.com/", dependencies=TRUE).
You can also use devtools to install directly from the Censornet Github
if(!require("devtools")) install.packages("devtools")
# Get or upgrade from github
devtools::install_github("censornet/HIBPwned")
Source is available on the Censornet Github, as is recommended usage guidance.
As you run any of the HIBPwned functions, be sure to have called the library first: library("HIBPwned").

As mentioned, I've seen my share of pwnage, luckily to no real impact, but annoying nonetheless, and well worth constant monitoring.
I first combined my accounts into a vector and confirmed what I've already mentioned, popped thrice:
account_breaches(c("rmcree@yahoo.com","holisticinfosec@gmail.com","russ@holisticinfosec.org"), truncate = TRUE)
$`rmcree@yahoo.com`
Name
1 Adobe

$`holisticinfosec@gmail.com`
Name
1 LinkedIn

$`russ@holisticinfosec.org`
Name
1 Patreon

You may want to call specific details about each breach to learn more, easily done continuing with my scenario using breached_site() for the company name or breached_sites() for its domain.

Breached

You may also be interested to see if any of your PII has landed on a paste site (Pastebin, etc.). The pastes() function is the most recent Steph added to HIBPwned.

Pasted

Uh oh, on the list here too, not quite sure how I ended up on this dump of "Egypt gov stuff". According to PK1K3, who "got pissed of at the Egypt gov", his is a "list of account the egypt govs is spying on if you find your email/number here u are rahter with them or slaves to them." Neither are true, but fascinating regardless.

Need some simple markdown to run every so often and keep an eye on your accounts? Try HIBPwned.Rmd. Download the file, open it R Studio, swap out my email addresses for yours, then select Knit HTML. You can also produce Word or PDF output if you'd prefer.

Report

Great stuff from Steph, and of course Troy. Use this wrapper to your advantage, and keep an eye out for other related work on itsalocke.com.

Sunday, May 08, 2016

toolsmith #116: vFeed & vFeed Viewer

Overview

In case you haven't guessed by now, I am an unadulterated tools nerd. Hopefully, ten years of toolsmith have helped you come to that conclusion on your own. I rejoice when I find like-minded souls, I found one in Nabil (NJ) Ouchn (@toolswatch), he of Black Hat Arsenal and toolswatch.org fame. In addition to those valued and well-executed community services, NJ also spends a good deal of time developing and maintaining vFeed. vFeed included a Python API and the vFeed SQLite database, now with support for Mongo. It is, for all intents and purposes a correlated community vulnerability and threat database. I've been using vFeed for quite a while now having learned about it when writing about FruityWifi a couple of years ago.
NJ fed me some great updates on this constantly maturing product.
Having achieved compatibility certifications (CVE, CWE and OVAL) from MITRE, the vFeed Framework (API and Database) has started to gain more than a little gratitude from the information security community and users, CERTs and penetration testers. NJ draws strength from this to add more features now and in the future. The actual vFeed roadmap is huge. It varies from adding new sources such as security advisories from industrial control system (ICS) vendors, to supporting other standards such as STIX, to importing/enriching scan results from 3rd party vulnerability and threat scanners such as Nessus, Qualys, and OpenVAS.
There have a number of articles highlighting impressive vFeed uses cases of vFeed such as:

Affordable Security: Making the most of free tools and data (Splunk & Nessus)
Applying Data Analytics on Vulnerability Data (Enhancing data analytics for vulnerability management, from SANS)
SimaticScan (using vFeed data to create an industrial PLC security scanner)
Correlation and Visualization using vFeed

Needless to say, some fellow security hackers and developers have included vFeed in their toolkit, including Faraday (March 2015 toolsmith), Kali Linux, and more (FruityWifi as mentioned above).

The upcoming version vFeed will introduce support for CPE 2.3, CVSS 3, and new reference sources. A proof of concept to access the vFeed database via a RESTFul API is in testing as well. NJ is fine-tuning his Flask skills before releasing it. :) NJ, does not consider himself a Python programmer and considers himself unskilled (humble but unwarranted). Luckily Python is the ideal programming language for someone like him to express his creativity.
I'll show you all about woeful programming here in a bit when we discuss the vFeed Viewer I've written in R.

First, a bit more about vFeed, from its Github page:
The vFeed Framework is CVE, CWE and OVAL compatible and provides structured, detailed third-party references and technical details for CVE entries via an extensible XML/JSON schema. It also improves the reliability of CVEs by providing a flexible and comprehensive vocabulary for describing the relationship with other standards and security references.
vFeed utilizes XML-based and JSON-based formatted output to describe vulnerabilities in detail. This output can be leveraged as input by security researchers, practitioners and tools as part of their vulnerability analysis practice in a standard syntax easily interpreted by both human and machine.
The associated vFeed.db (The Correlated Vulnerability and Threat Database) is a detective and preventive security information repository useful for gathering vulnerability and mitigation data from scattered internet sources into an unified database.
vFeed's documentation is now well populated in its Github wiki, and should be read in its entirety:

vFeed features include:

Easy integration within security labs and other pentesting frameworks
Easily invoked via API calls from your software, scripts or from command-line. A proof of concept python api_calls.py is provided for this purpose
Simplify the extraction of related CVE attributes
Enable researchers to conduct vulnerability surveys (tracking vulnerability trends regarding a specific CPE)
Help penetration testers analyze CVEs and gather extra metadata to help shape attack vectors to exploit vulnerabilities
Assist security auditors in reporting accurate information about findings during assignments. vFeed is useful in describing a vulnerability with attributes based on standards and third-party references(vendors or companies involved in the standardization effort)

vFeed installation and usage

Installing vFeed is easy, just download the ZIP archive from Github and unpack it in your preferred directory or, assuming you've got Git installed, run git clone https://github.com/toolswatch/vFeed.git.

You'll need a Python interpreter installed, the latest instance of 2.7 is preferred. From the directory in which you installed vFeed, just run python vfeedcli.py -h followed by python vfeedcli.py -u to confirm all is updated and in good working order; you're ready to roll.

You've now read section 2 (Usage) on the wiki, so you don't need a total usage rehash here. We'll instead walk through a few options with one of my favorite CVEs: CVE-2008-0793.

If we invoke python vfeedcli.py -m get_cve CVE-2008-0793, we immediately learn that it refers to a Tendenci CMS cross-site scripting vulnerability. The -m parameter lets you define the preferred method, in this case, get_cve.

Groovy, is there an associated CWE for CVE-2008-0793? But of course. Using the get_cwe method we learn that CWE-79 or "Improper Neutralization of Input During Web Page Generation ('Cross-site Scripting')" is our match.

If you want to quickly learn all the available methods, just run python vfeedcli.py --list.

Perhaps you'd like to determine what the CVSS score is, or what references are available, via the vFeed API? Easy, if you run...

from lib.core.methods.risk import CveRisk

cve = "CVE-2014-0160"

cvss = CveRisk(cve).get_cvss()

print cvss

You'll retrieve...

For reference material...

from lib.core.methods.ref import CveRef

cve = "CVE-2008-0793"

ref = CveRef(cve).get_refs()

print ref

Yields...

And now you know...the rest of the story. CVE-2008-0793 is one of my favorites because a) I discovered it, and b) the vendor was one of the best of many hundreds I've worked with to fix vulnerabilities.

vFeed Viewer

If NJ thinks his Python skills are rough, wait until he sees this. :-)

I thought I'd get started on a user interface for vFeed using R and Shiny, appropriately name vFeed Viewer and found on Github here. This first version does not allow direct queries of the vFeed database as I'm working on SQL injection prevention, but it does allow very granular filtering of key vFeed tables. Once I work out safe queries and sanitization, I'll build the same full correlation features you enjoy from NJ's Python vFeed client.
You'll need a bit of familiarity with R to make use of this viewer.
First install R, and RStudio. From the RStudio console, to ensure all dependencies are met, run install.packages(c("shinydashboard","RSQLite","ggplot2","reshape2")).
Download and install the vFeed Viewer in the root vFeed directory such that app.R and the www directory are side by side with vfeedcli.py, etc. This ensures that it can read vfeed.db as the viewer calls it directly with dbConnect and dbReadTable, part of the RSQLite package.
Open app.R with RStudio then, click the Run App button. Alternatively, from the command-line, assuming R is in your path, you can run R -e "shiny::runApp('~/shinyapp')" where ~/shinyapp is the path to where app.R resides. In my case, on Windows, I ran R -e "shiny::runApp('c:\\tools\\vfeed\\app.R')". Then browser to the localhost address Shiny is listening on. You'll probably find the RStudio process easier and faster.
One important note about R, it's not known for performance, and this app takes about thirty seconds to load. If you use Microsoft (Revolution) R with the MKL library, you can take advantage of multiple cores, but be patient, it all runs in memory. Its fast as heck once it's loaded though.
The UI is simple, including an overview.

At present, I've incorporated NVD and CWE search mechanisms that allow very granular filtering.

As an example, using our favorite CVE-2008-0793, we can zoom in via the search field or the CVE ID drop down menu. Results are returned instantly from 76,123 total NVD entries at present.

From the CWE search you can opt to filter by keywords, such as XSS for this scenario, to find related entries. If you drop cross-site scripting in the search field, you can then filter further via the cwetitle filter field at the bottom of the UI. This is universal to this use of Shiny, and allows really granular results.

You can also get an idea of the number of vFeed entries per vulnerability category entities. I did drop CPEs as their number throws the chart off terribly and results in a bad visualization.

I'll keep working on the vFeed Viewer so it becomes more useful and helps serve the vFeed community. It's definitely a work in progress but I feel as if there's some potential here.

Conslusion

Thanks to NJ for vFeed and all his work with the infosec tools community, if you're going to Black Hat be certain to stop by Arsenal. Make use of vFeed as part of your vulnerability management practice and remember to check for updates regularly. It's a great tool, and getting better all the time.
Ping me via email or Twitter if you have questions (russ at holisticinfosec dot org or @holisticinfosec).
Cheers…until next month.

Acknowledgements

Nabil (NJ) Ouchn (@toolswatch)