This is the second and likely final post I will do on using Circos for an IT investigation. In the first part of my detailed overview of how I had used Circos, I overviewed the gathering of evidence, and then how that evidence was used to generate the initial Circos visualization. In this post, we will add the extra details step by step that resulted in my final graph.


The first thing we need to do is add the banding to each of the “Users” arc’s. The bands in Circos are defined using the configuration line:-


BAND [Parent CHROMOSOME] BAND-NAME BAND-NAME START-POS END-POS COLOR


Unlike other types of potential visualizations, my requirement is that a single band represents an email. So I needed to add a band configuration that was defined by 1 unit. This produced a band configuration as shown below which was appended to my karyotype file.



If we now run the Circos command again as was shown in part one with the new karyotype file we will end up with a visualization that now includes bands on each of the User’s arc’s.



We are now ready to start defining the links that show the which user sent which email and to who. To draw a line to to link two User segments together, you need to specify the co-ordinates of each end point under a common Link ID as per the following:-


LINKID [Parent 1 CHROMOSOME] StartCo-Ord EndCo-Ord

LINKID [Parent 2 CHROMOSOME] StartCo-Ord EndCo-Ord



When I was creating my link file, I went back to my original Excel spreadsheet and created a dedicated “link” worksheet to track all the connections. I sorted my Message summary worksheet (the first worksheet I used to record all the details of each message) by Message ID, and then by Sender. From there, I worked down my way down the rows by MessageID, adding the appropriate details into my “link” worksheet.


Making use of the fantastic flexibility of Circos, then I created separate link files for each “Sender”. This allowed me to use the link “Rules” option to define different link line formats (i.e. Color) to make it easier to determine who was the sender of an email and who was receiving.


The final part of the visualization was the add the outside histogram. In my previous posted I mentioned that I also recorded the Date/Time in the summary Excel worksheet for each MessageID. I created a new column in my excel spreadsheet and used this formula to determine the age in months since the email had been sent:-


=(YEAR(DATE(2009,10,1))-YEAR(H52))*12 + MONTH(DATE(2009,10,1)) – MONTH(H52)


Note: column H was where the Date time was recorded in the summary sheet.


I then created an additional column to divide the total number of months by 3 to define the age in quarters. I did it this way instead of modifying the original formula because I was playing around with the scale on the histogram to determine the best visual representation while still conveying meaning. To plot the histogram I had to define a plot file. I grouped both the sender and recipient together to ensure the histogram values were correct on either side of the link. The plot elements were defined in the file using the syntax:-


[Parent 1 CHROMOSOME] StartCo-Ord EndCo-Ord HISTOGRAM-Value

[Parent 2 CHROMOSOME] StartCo-Ord EndCo-Ord HISTOGRAM-Value



Once again while defining the histogram “plot” section in my Circos.conf file I made use of Circos’s rules. This time using 3 rules based on the histogram value to change the fill color. This allows people to quickly identify the more recent incidents versus older incidents. This helps, because if someone received a lot of inappropriate material only more than 12 months ago, it can be a different issue for HR than someone who has received a lot in the last 3 months.


As promised, I am posting a detailed overview of the steps I undertook to create the my Email investigation visualization using Circos that I wrote about here.


To help set the scene, the evidence used to generate the data files for the Circos configuration originated from a number of Exmerged (exported) Exchange mailboxes. These mailboxes were loaded into a copy of AccessData’s FTK v1.8 (sorry, but v2 really was bad for thumbnail high volume graphic analysis and v3 wasn’t released when I stared the investigation). Once the PST’s had been processed then I could use the Graphics tab and review the generated thumbnails of all images found to quickly identify inappropriate content and assign the attachments and the parent email to a bookmark. Once I had finished with the images, I then when back to the Overview tab and reviewed all the movie attachments and again bookmarked any identified material and its parent email. The benefit of the bookmark feature was that after I had analyzed all the Exmerged mailboxes and other content, I could access all the identified evidence in one place.


To prepare the Circos data files I made use of Microsoft Excel to store key information. Why Excel? Because it was handy and its easy to use if you need to sort data in different ways while you are trying to work out what you require. The key worksheet I used was a summary table of all the emails that I had marked as evidence. For each person that sent or was a recipient of an email, a row was created. This meant for most emails there were multiple rows of To/From pairs. To create the Circos configuration files the key columns of the summary table in Excel were:-


MessageID

This allowed me to track a common email that went to numerous people. If with in the body of the email it could be seen that the content had been forwarded before to internal people, then additional rows were generated for those details, however utilizing the same MessageID. This let me easily track the email flow between people.

Sender

Email address of the Sender of the email, or “External” as a group name if it was sent from outside the company to internal employees.

Recipient

Email address of the Recipient of the email, or “External” as a group name if was sent to someone on the Internet.

Date-Time

This was used for generating the histogram around the outside


Other information that I recorded, that while was not useful for generating the visualization, helped to produce summary tables and raw evidence exports for HR. These columns were:-


Content Type

This was broken into PowerPoint, image or movie.

Src Mailbox

No matter how careful people are at cleaning their own Mailboxes, it only takes one person in the network not to delete emails for everyone involved to be implicated. This was used to identify which Exmerged mailbox the evidence was found in.

Subject

Email Subject line


Once I had all this data, I created a new worksheet called “karyotype”. I think I was lucky that I had a very clear idea about what my final visualization was going to look like, so I could attack the problem of generating the graph in steps. The first step was to generate a graph with all the Users arc’s with the right number of segments in them for the total number of emails sent and/or received. To determine all the “Users” I sorted the first worksheet by Sender+MessageID and recorded a the number of emails different emails sent against the User in the new “karyotype” worksheet. Then I did the same for Recipient+MessageID. Once I had a row for each individual User (plus one for “External” users), I created a third column for Total number of emails.


LABEL

# Sent

#Recv

#Total

User1

5

18

23

User2

1

0

1

User3

0

1

1

External

27

4

31

User4

0

1

1


This became the basis for creating my “karyotype” file. If you review the tutorials that Martin Krzywinski has created for Circos, you will note that the “karyotype” configuration file is defined as:-


Chr – ID LABEL START END COLOR


Therefore my karyotype definition became:



When you download the Circos tarball, there is a subdirectory that contains the configuration files for tutorials called “tutorials/”. When making my graphs, I copied the config files from one of the tutorial directories and placed them into my working subdirectory under “tutorials/” called “Investigation”. The benefit of doing this was that when creating the tutorials, Martin Krzywinski had already defined a number of common configuration files like the colors definitions. It is the in this colors.conf that the RGB for the color labels shown in my karyotype file above are defined.

At this point I tried to generate my first graph. First I had located and copied a copy from one of the tutorials the circos.conf, ticks.conf and ideogram.conf files into my working directory. Using circos command and the configuration files that I created/copied into the tutorial subdirectory I ran the program with in the following way.



Which produced the following visualization.



Once I got to this point, it was just a matter of following the tutorials on banding, linking and using histograms to generate the final visualization. I will walk through each of those steps in my next posting.

In my last blog I introduce the genome visualization tool called Circos created by Martin Krzywinski. In this post I am going to try provide an overview of the Circos tool in such a way that you can safely concentrate on what the genome terminology represents in the configuration files without being concerned about the specific meaning.

On the Circos webpage you will find excellent tutorials that Martin has already created and I have no intention of trying to reinvite the wheel. Instead, I hope to provide you with a type of Rosetta stone that you can reference when reading the tutorials so that you can more easily translate your requirements into the specific configuration changes you need.

At the core of Circos is the karyotype file. This file includes the total data set that you are basing your visualization on. For genetics, the file normally contains all data on the chromosomes, for my email investigation visualization, the karyotype file held the complete data for the 27 users being represented. An alternative data abstract for IT related visualizations may be the results of a traffic capture on a Class C network. In that case, the karyotype would hold each individual IP address and corresponding traffic type (e.g. Web, mail, P2P, FTP …). Do not be worried about having to filter and include only the data that you may want to use while you are still designing the visualization. Circos gives you plenty of flexibility in its configuration files to draw all the data, or only part of the data represented in the karyotype file.


The karyotype configuration file holds the chromosome data;
– Chromosomes would be equivalent to “email Users” in my investigation visualization
– Chromosomes would be equivalent to “IP addresses” in my network traffic example.



The next central term to understand is the ideogram. For Circos, and ideogram is the graphical representation of a chromosome, and potentially its sub-parts (bands). For my inappropriate investigation graph each “User” was represented by an ideogram. Each users ideogram was a different color, and was broken into segments/bands that represented individual emails of interest. In relation to the network traffic example, an ideogram for an IP address (network traffics chromosome) may be represented with different a color for different UDP/TCP ports, or could be shown as all 65535 ports with a single line for active ports.


An ideogram is a graphical representation for a chromosome;
– The ideograms in my investigation visualization were colored different and had bands for individual emails
– In a network traffic visualization the ideogram for an IP address may only represent active ports, or may show all ports with a line showing those with active traffic.



These are the key concepts you need to understand to work get started and work through the tutorials that Martin has already provided on the Circos hompage. In my next posting I will explain the configurations I used to generate the image I presented in my first blog on Circos.

Some time ago I wrote a blog explaining the visualization techniques I had developed to help non-technical HR personal interpret the overall scope of a particular investigation. While the specific evidence was perfectly fine to determine if there was a breach of policy, the depth of complicity of the end users actions can sometimes be hard to determine with just assorted evidence. An end user that only sends inappropriate content to one person a number times may be considered different to an end user that forwards one specific piece of inappropriate content to multiple people.

When the latest investigation of this nature appeared on my radar, I had a quick browse to see if there was a better way to automatically generate similar linking diagrams that I had previously created manually in Visio. While looking at secviz.org I notice the posted by Ben from the Honeynet project in Australia that used the graphic tool Circos.

The Circos tool was created by Martin Krzywinski for visualizing links in genomes. While the Circos seemed to be very flexible in the amount of information that could be visualized, it was very industry specific and the configuration terminology is specific to genetics. While it took a bit of time to reverse engineer the terminology in my head and really start to understand how Circos works, I believe this tool could be of great value in the IT space for all sorts of visualizations of large data sets where you want to show relationships. Because of this, I am planning to follow up this blog with a more detailed explanation of the Circos configuration files I produced in the hope that it helps others make use of this tool.

For my first attempt at using Circos I ended up to mapping 26 different internal users, and grouping all external users as another entity. This produce the following graphic.

While at first glance, the graphic looks impressive and seems to be very complicated, once you understand how to read it, it quickly becomes very useful for showing the overall relationships between who was sending and receiving inappropriate content, and how many “networks” of people were involved.

If we look at the inner band first, you will note that the circumference is broken up into 27 different colored parts. Each part for this visualization represents a different internal end user; or in the case of the light blue where “User 4” would be, all external users to the company. Each colored arc is then broken into smaller segments. Each of these segments represents a specific email (or emails of similar content if the end user also forwarded it on after receipt) that was sent or received. The lines that link different users is colored the same color as the arc that represents the end user that sent the email.

circos-image - arc and links section.png

One of the benefits of the Circos tool is that you can add multiple bands of data in the visualization. Using this ability, I added around the outside of my final graphic a histogram that also shows the age of the email for each segment. As explained in my original post on this topic, adding a time period can be important for HR to determine the appropriate discipline. In my graphic, the histogram for the Y axis is broken into 6 monthly segments. For each email represented, a bar is drawn to show how old (from the date of analysis) the original email in question was received. To make it easier to see trends in time I also used a red bar to represent any emails with in 3 months, orange for 3-9 months and green for any emails from 9-18 months old.

circos-image- historgram arc.png

Circos is a wonderful tool, and I definitely plan to expand my use of it in the future. One of the next projects I want to use it for is to visualize internal WAN traffic (probably netflow data) to better understand the internal traffic inter-relationships.

I have recently been working my way through the SANS (http://www.sans.org) coarse on Reverse Engineering Malware, which has been an extremely enjoyable experience. Anyway, while reading the sections on advance JavaScript obfuscation which explain how malware authors use the capabilities of the JavaScript argument.callee function to make analysis and debugging a lot harder, it struck me that there may be an opportunity to actually to turn this capability on its head and use it to protect against such malicious JavaScript.

If a JavaScript malware author uses the arguement.callee function to either check for script modification, by change in script length or by implementation of a checksum, or the arguement.callee function is used in more advance methods like as the source key for decoding the main JavaScript. Then why can’t we implement protective measures that add random blank lines and random length “canary comments” before any JavaScript is processed by the end browser.

This might be able to be implemented as a browser plug-in (e.g. NoScript capability??), in Anti-Virus/HIPS agents or even on a proxy server that does content scanning at the perimeter.

Will wide spread use of adding such “canary values” on pre-processed JavaScript diminish the threat from malicious code? No. But if it reduces the execution effectiveness of a tool that is currently used to mainly make decoding and analysing malicious JavaScript difficult, then it has to be a positive. Plus in the short term, it would gives some people better protection against the malicious JavaScript that are dependent on arguement.callee protections, and make reverse engineering malicious JavaScript’s a bit simpler.

In my last post, I used the regtime.pl and mactime tools to help determine the potential time a malware infection occurred. In this post, which is very similar to the previous post, I will follow the same steps, however this time I will use the Sleuthkit tools and mactime to analyse the file system changes to determine potential infection time. Normally, you would start with either the registry or the file system mactime, and then move to the alternative based on your findings. However, I thought it would be beneficial to show how the timeline generation and analysis is the same no matter which you start with.

This time using the SANS Forensics SIFT Workstation VM image, I will use the SleuthKits fls and ils commands to produce file system information that can be used by the mactime utility to produce a timeline.

After starting the SIFT workstation I mounted the suspect hard drive to a read-only mount point.

Using the command format of “fls –r –m C: <filepath> > /tmp/fls.log” the file system on the suspect drive was processed for to retrieve any information on allocated or unallocated files in the file system.

The options used were:-

-m

mactime output

-r

recursive

<filepath>

/dev/sdc1   – This is the device file for the partition being analysed

Using the command format of “ils –m <filepath> > /tmp/ils.log” the filesystem on the suspect drive was processed to retrieve any unallocated inodes on the partition being analysed.

The options used were:-

-m

mactime output

<filepath>

/dev/sdc1   – This is the device file for the partition being analysed

Once I had the separate output files I then ran “cat” to join them all together. This was done by using the command:

cat <filename> >> /tmp/mactime-body

Then I ran the Sleuthkit mactime program across the mactime body file in 3 ways.

    1. Mactime -b /tmp/mactime-body > /tmp/mactime-body.log
    2. Mactime -b /tmp/mactime-body -d -m > /tmp/mactime-body.csv
    3. Mactime -b /tmp/mactime-body -d -m 2009-01-01 > /tmp/mactime-body2009.csv
-b format output in mactime body format
-d create a comma delimited file
-m use numeric months and not named (i.e. 01 not Jan)
2009-01-01 print only timestamps after this date

The first one will give a full dump in standard Sleuthkit mactime default output. The second one will output a full mactime file in a comma delimited format where each line has its own timestamp. The last one is the same as the second except I am only outputting any information that changed after the 1st Jan 2009.

From there I copied the processed mactime files from the SIFT virtual workstation onto a machine with Excel2007 on it. You really want to be using Office2007 to get around the row limit in previous versions of excel. The benefit of using Excel is that it can be quick and easy to sort, search and filter information that may be of interest in the mactime output files. For example, loading up the mactime-body2009.csv file I can do a find all on .exe files that are modified (including created and deleted) in the C:\Windows\system32 directory. The main reason any  .exe should be modified here is if there is a Microsoft patch installed. However, since this directory is included normally in the execution search path, malware likes to be dropped in here to avoid execution issues.

image

Attached is a copy of the the output of the find all command in Excel 2007. When reviewing the timeline we can locate the same time period that was determine in the previous blog as a point in time of interest.

The following is an overview of how I used the SANS Forensics SIFT Workstation VM image to investigate a laptop that was infected with malware. The goal of the investigation was to determine if possible how the machine got infected, and when it was infected. To this end I used the regtime.pl utility that is supplied with the image.

The regtime.pl utility will process the timestamps in each key of a registry HIVE and produce output that is compliant to the SleuthKit’s mactime format.

After starting the SIFT workstation I mounted the suspect hard drive to a read-only mount point. The regtime.pl utility can be found in the the “/usr/local/src/windows-perl/” directory.

cd /usr/local/src/windows-perl/

Using the command format of “perl regtime.pl -m <HIVENAME> -r <filepath> > /tmp/regtime-<HIVENAME>” the regtime.pl each HIVE on the suspect drive was processed.

The options used were:-

-m

mactime output

<HIVENAME>

HKLM/SAM
HKLM/SECURITY
HKLM/Software
HKLM/SYSTEM
HKLU

<filepath>

/<read-only mount path>/windows/system32/config/SECURITY

/<read-only mount path>/windows/system32/config/system

/<read-only mount path>/windows/system32/config/software

/<read-only mount path>/Document & Settings/<username>/NTUSER.DAT

Once I had the separate output files I then ran “cat” to join them all together. This was done by using the command:

cat <filename> >> /tmp/regtime-body

Then I ran the Sleuthkit mactime program across the mactime body file in 3 ways.

    1. Mactime -b /tmp/regtime-body > /tmp/regtime-mactime
    2. Mactime -b /tmp/regtime-body -d -m > /tmp/regtime-mactime.csv
    3. Mactime -b /tmp/regtime-body -d -m 2009-01-01 > /tmp/regtime-mactime2009.csv
-b format output in mactime body format
-d create a comma delimited file
-m use numeric months and not named (i.e. 01 not Jan)
2009-01-01 print only timestamps after this date

The first one will give a full dump in standard Sleuthkit mactime default output. The second one will output a full mactime file in a comma delimited format where each line has its own timestamp. The last one is the same as the second except I am only outputting any information that changed after the 1st Jan 2009.

From there I copied the processed mactime files from the SIFT virtual workstation onto a machine with Excel2007 on it. You really want to be using Office2007 to get around the row limit in previous versions of excel. The benefit of using Excel is that it can be quick and easy to sort, search and filter information that may be of interest in the mactime output files. For example, loading up the regtime-mactime2009.csv file I can do a find all on common registry keys that malware play with.

image

Attached is a copy of the first item found in the registry and the items around it. When reviewing the timeline there is a 3 minute interval where a number of registry keys are modified, however there is no other activity for hours on either side of this activity. This would be a good indication that we may want to look at exactly what changes were made in the registry (if they still exist) to see if this was malware activity.

In the next blog post, I will use the same process using the Sleuthkit tools.