{"id":707,"date":"2020-01-16T18:15:39","date_gmt":"2020-01-17T02:15:39","guid":{"rendered":"http:\/\/blog.nillsf.com\/?p=707"},"modified":"2020-01-16T18:16:47","modified_gmt":"2020-01-17T02:16:47","slug":"analyse-storage-account-logs-using-python-in-azure-notebooks","status":"publish","type":"post","link":"https:\/\/blog.nillsf.com\/index.php\/2020\/01\/16\/analyse-storage-account-logs-using-python-in-azure-notebooks\/","title":{"rendered":"Analyse Storage Account logs using Python in Azure Notebooks"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Azure Storage can provide you detailed log information about all transactions happening against your storage account. There are default metrics that are gathered and shown through Azure Monitor. Additionally, you can configure logging on the storage account that give you a log information on a per request basis.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Analyzing these logs can be a bit difficult and painful. These logs are spread out in multiple files, and are simple comma separated files. In this blog post I&#8217;ll explain how you can analyse these logs using Python and Pandas. I&#8217;m very new to this, so my solution might not be the best. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s have a look!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Setting up storage logs<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">For the purpose of this demo, I&#8217;ll create a new storage account. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"780\" height=\"694\" src=\"\/wp-content\/uploads\/2020\/01\/image-40.png\" alt=\"\" class=\"wp-image-708\" srcset=\"https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-40.png 780w, https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-40-300x267.png 300w, https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-40-768x683.png 768w\" sizes=\"auto, (max-width: 780px) 100vw, 780px\" \/><figcaption>Creating a new storage account.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Next up, we&#8217;ll go into the classis diagnostics settings. Then we&#8217;ll enable our storage logging. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"752\" height=\"616\" src=\"\/wp-content\/uploads\/2020\/01\/image-41.png\" alt=\"\" class=\"wp-image-709\" srcset=\"https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-41.png 752w, https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-41-300x246.png 300w\" sizes=\"auto, (max-width: 752px) 100vw, 752px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">This will create a new storage container called <code>$logs<\/code> that will contain all the logs. These logs contain a wealth of info, like:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Time<\/li><li>API call<\/li><li>HTTP status code<\/li><li>Request IP<\/li><li>User-Agent-header<\/li><li>&#8230;<\/li><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">If you&#8217;re doing some troubleshooting on storage, these logs might be very useful for you.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">At this point, you&#8217;ll want to start creating some data. I used the Azure Storage Explorer to upload some sample data, that would generate some sample logs for me to use.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Consuming storage logs in Azure Notebooks<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">If you&#8217;ve never dealt with an Azure Notebook, let me take a minute to explain what they are. Azure Notebooks is a free hosted service to develop and run Jupyter notebooks in the cloud with no installation. It allows you so quickly run Python or R based notebooks without having to provision any infrastructure.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I personally have played around with a couple of notebooks as part of my learning for <a href=\"https:\/\/blog.nillsf.com\/index.php\/2019\/10\/26\/exploring-the-dp-100-certification-path\/\">DP-100<\/a>, but I never started from scratch, which is what I did for this one.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As a goal for my log analysis here, I wanted to get a count per minute of each &#8216;User-Agent-header&#8217; connecting to my Azure storage account. Let me walk you through this:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Start by going to <a href=\"https:\/\/notebooks.azure.com\/\">Azure Notebooks<\/a> and signing in. If this is your first time signing in, you&#8217;ll need to provide a name for your profile.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"661\" height=\"278\" src=\"\/wp-content\/uploads\/2020\/01\/image-44.png\" alt=\"\" class=\"wp-image-712\" srcset=\"https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-44.png 661w, https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-44-300x126.png 300w\" sizes=\"auto, (max-width: 661px) 100vw, 661px\" \/><figcaption>If this is the first time logging in, create a user ID.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Once that is created, head on over to &#8216;My Projects&#8217; and create a new project.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"894\" height=\"260\" src=\"\/wp-content\/uploads\/2020\/01\/image-45.png\" alt=\"\" class=\"wp-image-713\" srcset=\"https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-45.png 894w, https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-45-300x87.png 300w, https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-45-768x223.png 768w\" sizes=\"auto, (max-width: 894px) 100vw, 894px\" \/><figcaption>Create a new project.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Once you have the project open, you&#8217;ll also want to create a Notebook. In our case, we&#8217;ll create a Python 3.6 notebook.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"765\" height=\"468\" src=\"\/wp-content\/uploads\/2020\/01\/image-46.png\" alt=\"\" class=\"wp-image-714\" srcset=\"https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-46.png 765w, https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-46-300x184.png 300w\" sizes=\"auto, (max-width: 765px) 100vw, 765px\" \/><figcaption>Create a new Notebook<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"685\" height=\"533\" src=\"\/wp-content\/uploads\/2020\/01\/image-47.png\" alt=\"\" class=\"wp-image-715\" srcset=\"https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-47.png 685w, https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-47-300x233.png 300w\" sizes=\"auto, (max-width: 685px) 100vw, 685px\" \/><figcaption>Give it a name and a Python version<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Click on the workbook name, and you&#8217;ll be taken into your Notebook. A Notebook can have multiple cells that share memory state, but can be executed independently. If you&#8217;ve never touched a notebook before, why don&#8217;t you type <code>print('Hello World') <\/code>to actually do your first Hello World! Hit either the graphical run button or <code>CTRL+Enter<\/code> to run the cell.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"732\" height=\"334\" src=\"\/wp-content\/uploads\/2020\/01\/image-48.png\" alt=\"\" class=\"wp-image-716\" srcset=\"https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-48.png 732w, https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-48-300x137.png 300w\" sizes=\"auto, (max-width: 732px) 100vw, 732px\" \/><figcaption>Running Hello World in Jupyter<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Now comes the fun part, the actual code. Going through the code is hard to explain, but let me provide you with the essential code and steps that I wrote to get this working:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from azure.storage.blob import BlockBlobService\nimport csv\nfrom io import StringIO\nimport pandas as pd\nimport datetime<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The snippet of code above is just to import the necessary libraries into our working set.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>block_blob_service = BlockBlobService(account_name=\"nfanalytics\", account_key=\"xxx\")\nblobs = block_blob_service.list_blobs(\"$logs\")\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The snippet above is to list all blobs in the <code>$logs<\/code> container.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>list = []\nfor blob in blobs:\n    #print(blob.name + '\\n')\n\n    blobcontent = block_blob_service.get_blob_to_text(blob_name=blob.name,container_name=\"$logs\").content\n    cleanline = StringIO(blobcontent)\n    reader = csv.reader(cleanline, delimiter=';') \n    for line in reader:\n        \n        #print(line[1]) #date  \n        #print(line[2]) #API\n        #print(line[27]) #user-agent\n        list.append([line[1],line[2],line[27]])<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The snippet above actually opens each log file, and stores the data I am interested in (date, API and user-agent) in an Array.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>df = pd.DataFrame(list, columns = ['Time', 'API', 'Source']) <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">And the following snippet imports this into Pandas. Pandas is a Python library that makes it easier to work with collections of data and do aggregations, summaries and much more.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s take a brief pause here and execute all our cells, and think about what we&#8217;ve done and what we still need to do.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"630\" height=\"569\" src=\"\/wp-content\/uploads\/2020\/01\/image-49.png\" alt=\"\" class=\"wp-image-717\" srcset=\"https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-49.png 630w, https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-49-300x271.png 300w\" sizes=\"auto, (max-width: 630px) 100vw, 630px\" \/><figcaption>The code we&#8217;ve writen and executed thus far.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Thus far we have:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>downloaded our logs from blob<\/li><li>loaded them into an array<\/li><li>transformed that array into a Pandas Dataframe.<\/li><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">What we now need to do is find a way to transform this data to represent a graph that shows us per minute which user-agent is being used most often. Let&#8217;s explore.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s start our exploration by having a look at our dataframe. This can be done via <code>df.head()<\/code>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"627\" height=\"302\" src=\"\/wp-content\/uploads\/2020\/01\/image-50.png\" alt=\"\" class=\"wp-image-718\" srcset=\"https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-50.png 627w, https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-50-300x144.png 300w\" sizes=\"auto, (max-width: 627px) 100vw, 627px\" \/><figcaption>Exploring the dataframe<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Now, we&#8217;ll do a couple steps to clean up the Time info here. We&#8217;ll first convert into an actual datetime object and then use that as an index for our timeseries:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>df['Time'] =  pd.to_datetime(df['Time'])\ndfi = df.set_index('Time')<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Now comes a bit of a weird step. We&#8217;ll use the <code>get_dummies<\/code> function to transform our column with the Sources into a column and a counter. This will come in handy once we aggregate our data per minute and want to sum our occurrences (which we&#8217;ll also do). If you want to understand what is going on with this step, you can peak at the data again using the <code>head()<\/code> function.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>dfidum = pd.get_dummies(dfi, columns=['Source'])\ndfimin = dfidum.resample('60s').sum()<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"611\" height=\"451\" src=\"\/wp-content\/uploads\/2020\/01\/image-51.png\" alt=\"\" class=\"wp-image-719\" srcset=\"https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-51.png 611w, https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-51-300x221.png 300w\" sizes=\"auto, (max-width: 611px) 100vw, 611px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">And now we can do our actual analysis. There&#8217;s a couple things we can do now. An interesting numerical analysis is using the <code>describe()<\/code> function, like this. <code>dfimin.describe()<\/code>. This describe does a numerical analysis on all columns.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"622\" height=\"434\" src=\"\/wp-content\/uploads\/2020\/01\/image-52.png\" alt=\"\" class=\"wp-image-720\" srcset=\"https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-52.png 622w, https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-52-300x209.png 300w\" sizes=\"auto, (max-width: 622px) 100vw, 622px\" \/><figcaption>Using describe() gives us interesting analysis.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">We can also create a plot, that shows the time series in a nice graph. This can be done via <code>dfimin.plot()<\/code>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"638\" height=\"270\" src=\"\/wp-content\/uploads\/2020\/01\/image-53.png\" alt=\"\" class=\"wp-image-721\" srcset=\"https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-53.png 638w, https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-53-300x127.png 300w\" sizes=\"auto, (max-width: 638px) 100vw, 638px\" \/><figcaption>Creating a graph using the plot() function.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In this blog post we looked into how we can do some rudimentary analysis of Azure storage logs using a Jupyter notebook. We were able to analyse the most popular user-agent strings when connecting to Azure storage. Which in my case is the Python SDK funnily enough.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is very basic analysis. I&#8217;m not a data analyst, but this analysis took me about 2 hours to build, so not too bad I would say. And best of all, it was all free!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Azure Storage can provide you detailed log information about all transactions happening against your storage account. There are default metrics that are gathered and shown through Azure Monitor. Additionally, you can configure logging on the storage account that give you a log information on a per request basis. Analyzing these logs can be a bit [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":721,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[2,4],"tags":[8,75,74,76,67],"class_list":["post-707","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-azure","category-management","tag-azure","tag-blob-storage","tag-data-engineering","tag-monitoring","tag-storage"],"jetpack_featured_media_url":"https:\/\/nillsfblog.blob.core.windows.net\/media\/2020\/01\/image-53.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/blog.nillsf.com\/index.php\/wp-json\/wp\/v2\/posts\/707","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.nillsf.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.nillsf.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.nillsf.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.nillsf.com\/index.php\/wp-json\/wp\/v2\/comments?post=707"}],"version-history":[{"count":2,"href":"https:\/\/blog.nillsf.com\/index.php\/wp-json\/wp\/v2\/posts\/707\/revisions"}],"predecessor-version":[{"id":723,"href":"https:\/\/blog.nillsf.com\/index.php\/wp-json\/wp\/v2\/posts\/707\/revisions\/723"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.nillsf.com\/index.php\/wp-json\/wp\/v2\/media\/721"}],"wp:attachment":[{"href":"https:\/\/blog.nillsf.com\/index.php\/wp-json\/wp\/v2\/media?parent=707"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.nillsf.com\/index.php\/wp-json\/wp\/v2\/categories?post=707"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.nillsf.com\/index.php\/wp-json\/wp\/v2\/tags?post=707"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}