Visualising Data with Pandas

15 June 2020

As part of my quest to improve my Python skills I've been learning how to visualise data using Pandas. When we parse out large data sets from logs or other artefacts recovered from forensics data it can be difficult to glean real-world meaning from tables and lists of numbers. Visualising this data is extremely useful in understanding the story it tells, more easily spotting patterns or interesting anomalies and answering the crucial questions who, what, when, where and why?

About Pandas

Pandas is a Python Data Analysis Library: https://pandas.pydata.org/about/index.html
As a complete Pandas beginner I did the following 2 Pluralsight courses as an introduction to this:
These courses use Jupyter Notebooks and Spyder which are both included when you download Anaconda. Included in the course materials are pre-prepared Jupyter notebook files so you can follow along with the instructor's examples.

I thought Spyder was particularly beginner-friendly because it made the "data frames" (the data structures used by Pandas) so much easier to understand by displaying them to you graphically in the "variable explorer" along with any other variables you are using. In this case my data frame ("df") was imported from a csv of data about artworks provided by Pluralsight.


Fig 1

Double clicking on a data frame opens it up:
Fig 2

This makes for easy viewing so you don't have to memorise/imagine what your data structures look like but also has the benefit of being directly convertable/editable:


Fig 3                                                                                    Fig 4

iOS Battery Levels

For my first attempt at implementing this for something forensic-y I wanted a dataset that was more than just plotting one value against another. I settled on using some iOS battery level data from the iOS 13 sample data posted by the Binary Hick. There would be the battery level vs timestamp data but I could add a further element to indicate whether the phone was plugged in and charging or unplugged and discharging. Hopefully these two would combine to show that the battery goes up when the phone is charging and goes down when it's not!

This is my code:


Fig 5

And this is the plot it produced- we can see the cyclical nature of the charging. The phone gains battery when plugged in and loses it when unplugged (ok, this one is obvious but not all correlations will be so intuitive):

Fig 6

There is some room for improvement aesthetically but it achieves my inital aim!

SRUM WiFi usage

SRUM is a great example of where visualsing patterns in data can be very valuable to a forensic investigation. "System Resource Usage Monitor" data, stored in SRUDB.dat, tracks network usage by programs on a Windows machine (this presentation is a good introduction). I grabbed my own SRUM data using Nirsoft's NetworkUsageView and made a few plots.

It's not very easy to see the story this data is telling us from looking at it in a basic table form. This file shows just the timestamps and bytes sent/received for the past couple of months of my usage (I don't care which program used what at the moment). The records exceeded 16,000 rows:

Fig 7

The first plot I made was simply plotting the total bytes sent and received for each timestamp. Pandas made it very easy to calculate this:


Fig 8

The result shows a spike in downloads mid-May (I downloaded a lot of data for a CTF) and a spike in uploads early-June (I uploaded some large files to cloud storage but something like this could also be exfiltration of files from a corporate network....).
Fig 9

I can add another layer of information by plotting data usage per application- here I've selected browsers only. This shows that I’ve gradually switched from browser A (which I used for the large download) to browser B (which I used for the large upload):

Fig 10

These plots also show a hint of a pattern of life which becomes clearer if we zoom in. Looking at just one week of data reveals my pattern of usage:


Fig 11

This is a good starting point from which to infer my typical working hours (the “day shift” Monday to Friday). This assessment may be corroborated (or refuted) by looking at other datasets such as the timings of communication events and when files were accessed. There is some usage in the early hours of Monday which I think was me leaving some processing/updates running overnight (I’m not sure- April seems like such a long time ago now....) but in a corporate environment instances of people working outside of their usual hours or outside the typical office hours could be notable.

The heatmap below shows the downloaded data for the same time period as fig.11. The start and end times of activity are visible, but it also suggests that I downloaded more in the mornings than the afternoons:

Fig 12

This data can be considered in conjunction with other artefacts showing program execution (such as prefetch, UserAssist) and file access (such as MRULists, lnk files, jumplists) to contribute to a picture of activity on a system.

Communications

I’m interested in using SanKey diagrams to visualise a summary of communications found within a dataset. My Googling suggests that this can be done with Python & Pandas but I haven’t implemented this yet. In order to get a feel for what this type of visualisation might look like I used a free site to create a SanKey diagram showing some fictional data:

Fig 13

I think something like this would be useful to summarise/triage data. I used the same website to generate a summary for the Binary Hick’s iOS 13.4.1 extraction:

Fig 14

This is a little different to what I was expecting based on my made-up data but I think it’s because there are small numbers of communications & contacts spread across a large number of apps. This makes sense for a dataset where the goal is to generate as many interesting artefacts as possible but might not be what “real” usage looks like. Still, we can instantly get a feel for the distribution of communications between different types (Email vs Calls vs Chats) and distribution between apps (it’s easy to see that Apple Telephony, TextNow and Apple Mail have the most events). I plan to see if I can replicate this in Python and hopefully I can apply it to some real data at a later date.

Geo

Geo is also on my to-do list. Geopandas can be used to plot geo-based data onto maps- I haven’t explored this in detail yet but there is great potential for visualising geo coordinates extracted from forensics data such as EXIF from media items, searched destinations and routes travelled.

Conclusions

These beginner plots don't really do the potential of Pandas justice but check out the types of plots it can make for more inspiration. Having said that, after only a few video tutorials, some experimentation (and frankly a lot of time on Stack Overflow) I've been able to make visualisations that could add real value to my forensic work. If I can get the plots looking a bit more polished, I think that in certain circumstances they will also make a valuable addition to reports. Given a standardised input (such as CSVs generated by many useful open source tools) these plots could also be at least semi-automated.


Popular posts from this blog

I can't remember my password! (dfchallenge.org CTF Write-Up)

MemLabs: Lab 5 – Black Tuesday (“Medium-Hard”)

MemLabs: Lab 4 - Obsession (“Medium”)