24 Sep 2015

Introduction to AzureML and Machine Learning


Overall time to complete: 20 minutesPrerequisites: A valid Azure subscription and a valid Microsoft or Organizational that is able to access the subscription as a Service Administrator

1. Creating an AzureML namespace

1. Sign-in to the Microsoft Azure portal at http://manage.windowsazure.com. (NOTE: AzureML is only available in current Management Portal at this time).


2. Ensure that you have enabled Machine Learning as a Preview Feature. To do this you can select Machine Learning from the list of preview features from this page. http://azure.microsoft.com/en-gb/services/preview/

3. When you’ve been informed that Machine Learning is available to you (you'll receive an email) it should be visible in the Azure Portal.


4. Click on the +NEW button in the left hand corner of the screen to create a new AzureML workspace.


5. Provide a valid name for the AzureML workspace and provide a valid workspace owner. This will usually default to the Microsoft Account that you have used to access the Azure portal with. During the preview you should note that AzureML is only available in South Central US. Enter a storage account name which exists already in South Central US or elect to create a new account, with a unique name, in this region.


6. Click on the CREATE AN ML WORKSPACE link to create the new AzureML workspace. The new AzureML workspace will be available in about 2-3 minutes.

7. You should receive an email as the workspace owner welcoming you to AzureML. The new workspace will be listed in the portal.


2. Upload a new dataset

1. Click on the “Open in Studio” button on the bottom taskbar of your AzureML namespace. This will open up a second browser tab which will contain MLStudio.


2. Navigate to the new tab and at the bottom left hand corner of the screen you should see the now familiar "+NEW" button. Click this button.


3. Select DATASET -> "From Local File" from the new dialogue.


4. A new dialogue will be presented to you. From this, select "Choose File" and navigate to the course root directory -> assets -> dataml.csv.


5. In the "ENTER A NAME FOR THE NEW DATASET" textbox enter Device Data. Leave everything else as you see it and click the tick mark in the right hand corner of the dialogue.

3. Create an experiment

1. Click on the +NEW button in the bottom left corner of the screen and select Experiment -> Blank Experiment.


2. The menu shows all of the top level options in AzureML.


3. Open the "Saved Datasets" node and drag the newly uploaded Device Data on to the canvas on the right-hand side.

4. Navigate to "Data Transformation" -> "Sample and Split" and drag the Split task to the canvas and position underneath the the dataset Device Data.


5. Connect the output port of the Device Data dataset and drag to the input port of the Split task to connect the items.

6. Click on the Split task and in the right hand properties Splitting Mode dialogue select regular expression from the splitting mode dropdown. Then add "type" ^temperature into the regular expression textbox.

7. We'll now select the Project Columns task from "Data Transformation" -> Manipulation. This module allows us to select which columns of data we want to include or exclude in the model. After we've dragged this onto the canvas select the Split task left output port and drag to the Project Columns input port.

8. Click on the Project Columns task and press the Launch column selector button in the properties window. In the properties textbox we'll enter reading and roomno (two of the columns in the dataset). Click the check mark in the lower right corner to accept the changes and close the column selector.


9. From "Data Transformation" -> Manipulation drag the Metadata Editor task onto the canvas. Connect the Project Columns task to the Metadata Editor task.

10. Click on the *Metadata Editor task in the design pane. Notice the Quick Help in the lower right corner of the Azure ML Studio. When you click the (more help)** link a new browser tab opens with a detailed description of the task and options. After reviewing the description and options, return to the Azure ML studio.

11. To edit the properties of the Metadata Editor, select the Launch column selector and enter reading and roomno.

12. Update each of the following properties:

– Data Type: Floating Point

– Categorical: Make non-categorical

– Fields: Features


13. Drag a second split task to the design pane and connect to the Metadata Editor. This time leave the splitting mode to the default "split rows" but the change the value in the textbox underneath "fraction of values in the rows first" from 0.5 to 0.8.

14. By now we should have the following on our canvas configured correctly


15. We're now ready to consider our machine learning algorithm so we'll drag three tasks onto the screen. Position and connect them as in the diagram below and in the correct list order.

– "Machine Learning" -> "Initialize Model" -> Clustering -> K-Means Clustering

– "Machine Learning" -> Train -> Train Clustering Model

– "Machine Learning" -> Score -> Assign to Clusters


16. Click on the K-Means Clustering task and update the properties page on the right-hand side. Change the value for Number of Centroids to 4 but keep all other values the same.


17. Click on the Train Clustering Model task and launch the column selector to select reading and roomno. Repeat for the Assign to Clusters task.

18. Now we've completed the steps we can drag another "Data Transformation" -> Manipulation -> "Metadata Editor" task to the canvas and connect the *Assign to Clusters output port to the new Metadata Editor input port. In the properties page select Launch column selector**, but this time choose a new column called "Assignments" and from the Categorical dropdown select "Make Categorical".

19. We can now rename the experiment to something sensible. This is done by clicking on the top of the experiment and typing in a new name. In this case we'll call the experiment "Device Data".


20. We'll now click on the Save button on the bottom toolbar followed by the Run button.


21. After running the experiment you'll see all of the tasks with a green tick in the corner.


22. Right-click at the bottom of "Assign to Clusters" (on the small circle) and select Visualize to see a graph of the principal components.


23. Right-click at the bottom of the final "Metadata Editor" task and select Visualize to see a table of all of the temperature device data and the cluster assignment (0-3) of each data point


n.b. The algorithm has been able to spot a number of skewed results which represent less than average or greater than average temperature readings in clusters 1-3

4. Enable access to a web service

1. Now that experiment is complete we can open up our model to the rest of the world! In order to do we'll click on the "Prepare Web Service" button on the bottom toolbar


2. Clicking the button will do the following:

– Introduce a task at the top of the workflow called "Web service input"

– Introduce a task at the bottom of the workflow called "Web service output"

– Introduce a new "switch" which toggles between the "web services view" and the "training experiment view"

3. We should now be defaulted to the web services view


4. We'll now move the web service connection from top directly to the "Assign to Clusters" task.


5. Run the experiment again, note that the "Prepare web service" button on the bottom toolbar has now changed to "Publish web service"

6. When the confirm dialogue appears click yes


7. When this operation has completed another browser tab will open containing all of the details about the new web service and how to access it


5. Test the web service

1. On the results page you should be able to see a small link which says "test" next to the REQUEST/RESPONSE "Api help page" link. Click on this link to test the service

2. Enter a roomno and reading in the dialogue e.g. 0 and 24.5


3. The task bar at the bottom should show the result


4. The result is returned JSON and should look something like this ["0", "24.5", "0"], where the first two arguments of the JSON array are roomno and reading and the final argument is the cluster 0 (remember clusters 0-3 from the previous step) which tells us that this data point is within the average set of data points

5. For completeness, you can click on the "API help page" link from the page in step (1) and navigate to the bottom of the page. Here you will see code in C#, R or Python which can be used to invoke the web service programmatically from frameworks such as Storm or from virtual machines in Azure

8 Sep 2015

Azure Stream Analytics


Create the Streaming Analytics Job

1. Navigate to the Microsoft Azure management interface https://manage.windowsazure.com
(NOTE: Azure Stream Analytics is only configurable in current Management Portal at this time)

2. Click "+NEW" in the bottom left hand corner of the screen

3. Select Data Services -> Stream Analytics, click Quick Create.


4. Configure Stream Analytics.

– Enter a job name and location (limited choice due to this being in preview, select any location)

– Select Create new storage account.

– Enter a new storage account name.

5. Click Create stream analytics job.


6. Once creation has finished, navigate to the job.


7. Select the Inputs tab.


8. Click "+Add Input" at the bottom middle of the screen.


9. Choose the default "Data stream" click next.


10. Choose the "Event Hub" option.


11. Enter the connection information for the event hub, click next.

– Input alias is MyEventHubStream (The name is important as it is references in the query)

– Select Use Event Hub from Current Subscription


12. Specify that the data serialization format is JSON and the encoding is UTF-8, click Finish


13. The connection to the storage account will be tested, this will take a moment to complete


14. A new input will be created.


15. Navigate to the Output tab


16. Click “Add Output" at the bottom middle of the screen


17. Choose SQL Database


18. Enter the connection information, click Finish.

– Choose Use SQL Database from Existing Subscription

– Select the database created in Hands on Lab 1

– Enter the user name and password used when the database was created

– Enter the table name AvgReadings


19. A new output will be created.


20. On the Query tab enter the following and select Save at the bottom.

SELECT DateAdd(minute,-1,System.TimeStamp) as WinStartTime, system.TimeStamp as WinEndTime, Type = 'Temperature', RoomNumber, Avg(Temperature) as AvgReading, Count(*) as EventCount
FROM MyEventHubStream
Where Temperature IS NOT NULL
GROUP BY TumblingWindow(minute, 1), RoomNumber, Type
SELECT DateAdd(minute,-1,System.TimeStamp) as WinStartTime, system.TimeStamp as WinEndTime, Type = 'Humidity', RoomNumber, Avg(Humidity) as AvgReading, Count(*) as EventCount
FROM MyEventHubStream
Where Humidity IS NOT NULL
GROUP BY TumblingWindow(minute, 1), RoomNumber, Type
SELECT DateAdd(minute,-1,System.TimeStamp) as WinStartTime, system.TimeStamp as WinEndTime, Type = 'Energy', RoomNumber, Avg(Kwh) as AvgReading, Count(*) as EventCount
FROM MyEventHubStream
GROUP BY TumblingWindow(minute, 1), RoomNumber, Type
SELECT DateAdd(minute,-1,System.TimeStamp) as WinStartTime, system.TimeStamp as WinEndTime, Type = 'Light', RoomNumber, Avg(Lumens) as AvgReading, Count(*) as EventCount
FROM MyEventHubStream
Where Lumens IS NOT NULL
GROUP BY TumblingWindow(minute, 1), RoomNumber, Type

21. On the Dashboard tab, start the job by pressing the "Start" button on the Azure menu at the bottom middle of the page


22. It will take a few moments to start, a minute or so later data should appear in the database table. Use Microsoft SQL Management Studio to view the result.

1 Jul 2015

HDInsight Cluster


Azure HDInsight deploys and provisions Apache Hadoop clusters in the cloud, providing a software framework designed to manage, analyze, and report on big data.

The Hadoop core provides reliable data storage with the Hadoop Distributed File System (HDFS), and a simple MapReduce programming model to process and analyze, in parallel, the data stored in this distributed system.

1.2 Create the HDInsight cluster

1. Navigate to the Microsoft Azure management interface https://manage.windowsazure.com (NOTE: HDInsight clusters are only configurable in current Management Portal at this time)

2. Ensure you are on the correct subscription by locating the subscriptions button in the upper right hand corner of the page.


3. Navigate to "HDInsight" tab on the left hand menu.


4. Click "+NEW" in the bottom left hand corner of the page.


5. This will open up the data services menu and highlight HDInsight (note the path Data Service -> HDInsight).


6. Click on the "Custom Create" button.


7. A New HDInsight Cluster creation wizard will open.

– Enter a name for the cluster; if available the interface will indicate with a green tick.

– Note the options for Cluster Type and HDInsight Version. Don't change any of the default settings, this will be Hadoop and version 3.1.

Click the "Next" arrow in the lower right corner.


8. Change Data Nodes to 1 to create a 1 node cluster. Select the Location to the same location specified when you created the storage account in HOL1. Click the "Next" arrow in the lower right corner. Leave the other options at their default value since these are the smallest sizes available.


9. An administrative user is created for access via the web browser in later exercises.

– Enter a username and password. You may want to store the username and password in a text document on the desktop for use later in the lab.

– Notice the option to enter the Hive/Oozie Metastore. Many production clusters will use a centralized Azure SQL Database to store Hive metadata, creating an agile method for bursting clusters on-demand, and reusing metadata across clusters over time. Our labs will not use a centralized metastore, do not check the option.

– Also notice the ability to enable Remote Desktop. Do not check this option we will configure later in the lab.

Click "Next".


10. Note the options under STORAGE ACCOUNT to use an existing storage account or create a new account. We will use the storage account created in Hand On Lab 1.

– Select Use Existing Storage.

– Change the ACCOUNT NAME drop-down to the name of the Storage Account that was created in HOL1.

– Select the container data we created in HOL1.

– ADDITIONAL STORAGE ACCOUNTS specifies the number of additional storage accounts for use with the cluster. Our labs only require the single storage account, accept the default.

Click "Next".


11. The final view **Script Actions" will add PowerShell script that can be run during the provisioning process. This is useful for installing additional software or features on the cluster. Microsoft has release several examples like R, Solr, and Spark. Our labs will not require additinal software or features.

Click the check mark in the lower right-hand corner to finish and create the cluster.


12. The provisioning process completes in 10-20 minutes. The HDInsight cluster and the cluster status will be visible in the list of available clusters.


2. Enable Remote Desktop

Some administrators may want to manage the cluster from the head node. HDInsight supports RDP to the head node, which is enabled through the portal, through PowerShell or the command line. The following steps will enable RDP through the management portal.

1. Navigate to the Microsoft Azure management interface [https://manage.windowsazure.com] Select HDInsight from the left menu and click on the cluster you just created. The quick start screen will be presented. Here you can select the Dashboard from the top menu to view general information on the cluster.


2. Select configuration from the top menu.


3. Select Enable Remote from the bottom of the page.


4. RDP will require a new username unique to the cluster.

– Enter a new username and password. The username must differ from the admin user chosen at cluster creation.

– Select an expiration date for RDP. NOTE: The expiration date must be in the future and no more than a week from the present. The expiration time of day is assumed by default to be midnight of the specified date.

Click Ok to configure RDP, this will take 2-3 minutes to complete.


3. Connecting to the HDInsight cluster

1. Once configuration is complete you can initiate the RDP connection using the Connection button located at the bottom of the screen. A open/save dialog will appear at the bottom of the page. Select Open.


2. A new dialog will open, Click Connect.


3. Enter your credentials, use the credentials specified as part of the Remote Desktop configuration in the previous step (5) when connecting to the instance.


4. Choose yes to accept the certificate.


5. You will then be presented with the a remote desktop connection for the head node of the cluster.


3.1 Using the Hadoop Filing System

1. Once logged in open the Hadoop Command Line. The link to this can be found on the desktop of the head node.


2. Execute the command

hadoop fs -ls /

3. The following set of files will be displayed


4. A number of directories will have been created as part of the cluster provisioning process. These directories should remain untouched so the cluster can operate normally.

– HdiSamples

• Contains a set of sample data

– apps

• Contains 3rd party apps that are required for the normal operation of the cluster

– example

• Additional examples and data

– hive

• Where the hive tables are stored

– mapred

• Contains the history of map reduce jobs

– user

• A set of libraries associated with the hadoop user

– yarn

• Holds the yarn application history data