0% found this document useful (0 votes)
217 views

04A - Working With Datastores - Jupyter Notebook PDF

This document discusses working with datastores in Azure Machine Learning. It shows how to connect to a workspace, view existing datastores, get a reference to a specific datastore, upload data files to that datastore, and then train a machine learning model using data from the datastore. The code uploads diabetes data files to the 'aml_data' datastore, gets a data reference, and defines a training script that accepts the data reference as a parameter to access and use the uploaded data for model training.

Uploaded by

jh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
217 views

04A - Working With Datastores - Jupyter Notebook PDF

This document discusses working with datastores in Azure Machine Learning. It shows how to connect to a workspace, view existing datastores, get a reference to a specific datastore, upload data files to that datastore, and then train a machine learning model using data from the datastore. The code uploads diabetes data files to the 'aml_data' datastore, gets a data reference, and defines a training script that accepts the data reference as a parameter to access and use the uploaded data for model training.

Uploaded by

jh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

6/23/2020 04A - Working with Datastores - Jupyter Notebook

Working with Datastores


Although it's fairly common for data scientists to work with data on their local file system, in an
enterprise environment it can be more effective to store the data in a central location where
multiple data scientists can access it. In this lab, you'll store data in the cloud, and use an Azure
Machine Learning datastore to access it.

Important: The code in this notebooks assumes that you have completed the first
two tasks in Lab 4A. If you have not done so, go and do it now!

Connect to Your Workspace


To access your datastore using the Azure Machine Learning SDK, you need to connect to your
workspace.

Note: If the authenticated session with your Azure subscription has expired since
you completed the previous exercise, you'll be prompted to reauthenticate.

In [1]: import azureml.core


from azureml.core import Workspace

# Load the workspace from the saved config file


ws = Workspace.from_config()
print('Ready to use Azure ML {} to work with {}'.format(azureml.core.VERSION, ws.

Ready to use Azure ML 1.8.0 to work with workspace200623

View Datastores in the Workspace


The workspace contains several datastores, including the aml_data datastore you ceated in the
previous task (labdocs/Lab04A.md).

Run the following code to retrieve the default datastore, and then list all of the datastores indicating
which is the default.

https://compute200623.westus2.instances.azureml.net/notebooks/Users/DP100/04A - Working with Datastores.ipynb 1/11


6/23/2020 04A - Working with Datastores - Jupyter Notebook

In [2]: from azureml.core import Datastore

# Get the default datastore


default_ds = ws.get_default_datastore()

# Enumerate all datastores, indicating which is the default


for ds_name in ws.datastores:
print(ds_name, "- Default =", ds_name == default_ds.name)

aml_data - Default = False


azureml_globaldatasets - Default = False
workspaceblobstore - Default = True
workspacefilestore - Default = False

Get a Datastore to Work With


You want to work with the aml_data datastore, so you need to get it by name:

In [3]: aml_datastore = Datastore.get(ws, 'aml_data')


print(aml_datastore.name,":", aml_datastore.datastore_type + " (" + aml_datastore

aml_data : AzureBlob (200623lab4a)

Set the Default Datastore


You are primarily going towork with the aml_data datastore in this course; so for convenience, you
can set it to be the default datastore:

In [4]: ws.set_default_datastore('aml_data')
default_ds = ws.get_default_datastore()
print(default_ds.name)

aml_data

Upload Data to a Datastore


Now that you have identified the datastore you want to work with, you can upload files from your
local file system so that they will be accessible to experiments running in the workspace,
regardless of where the experiment script is actually being run.

https://compute200623.westus2.instances.azureml.net/notebooks/Users/DP100/04A - Working with Datastores.ipynb 2/11


6/23/2020 04A - Working with Datastores - Jupyter Notebook

In [5]: default_ds.upload_files(files=['./data/diabetes.csv', './data/diabetes2.csv'], #


target_path='diabetes-data/', # Put it in a folder path in
overwrite=True, # Replace existing files of the same name
show_progress=True)

Uploading an estimated of 2 files


Uploading ./data/diabetes.csv
Uploading ./data/diabetes2.csv
Uploaded ./data/diabetes2.csv, 1 files out of an estimated total of 2
Uploaded ./data/diabetes.csv, 2 files out of an estimated total of 2
Uploaded 2 files

Out[5]: $AZUREML_DATAREFERENCE_b0c114d9481b4042b7d6639ffb65a9b5

Train a Model from a Datastore


When you uploaded the files in the code cell above, note that the code returned a data reference.
A data reference provides a way to pass the path to a folder in a datastore to a script, regardless of
where the script is being run, so that the script can access data in the datastore location.

The following code gets a reference to the diabetes-data folder where you uploaded the diabetes
CSV files, and specifically configures the data reference for download - in other words, it can be
used to download the contents of the folder to the compute context where the data reference is
being used. Downloading data works well for small volumes of data that will be processed on local
compute. When working with remote compute, you can also configure a data reference to mount
the datastore location and read data directly from the data source.

More Information: For more details about using datastores, see the Azure ML
documentation (https://docs.microsoft.com/azure/machine-learning/how-to-access-
data).

In [6]: data_ref = default_ds.path('diabetes-data').as_download(path_on_compute='diabetes


print(data_ref)

$AZUREML_DATAREFERENCE_7e8f9a0a713747c9a7fc2063f505826c

To use the data reference in a training script, you must define a parameter for it. Run the following
two code cells to create:

1. A folder named diabetes_training_from_datastore


2. A script that trains a classification model by using the training data in all of the CSV files in the
folder referenced by the data reference parameter passed to it.

https://compute200623.westus2.instances.azureml.net/notebooks/Users/DP100/04A - Working with Datastores.ipynb 3/11


6/23/2020 04A - Working with Datastores - Jupyter Notebook

In [7]: import os

# Create a folder for the experiment files


experiment_folder = 'diabetes_training_from_datastore'
os.makedirs(experiment_folder, exist_ok=True)
print(experiment_folder, 'folder created.')

diabetes_training_from_datastore folder created.

https://compute200623.westus2.instances.azureml.net/notebooks/Users/DP100/04A - Working with Datastores.ipynb 4/11


6/23/2020 04A - Working with Datastores - Jupyter Notebook

In [8]: %%writefile $experiment_folder/diabetes_training.py


# Import libraries
import os
import argparse
from azureml.core import Run
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Get parameters
parser = argparse.ArgumentParser()
parser.add_argument('--regularization', type=float, dest='reg_rate', default=0.01
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data fol
args = parser.parse_args()
reg = args.reg_rate

# Get the experiment run context


run = Run.get_context()

# load the diabetes data from the data reference


data_folder = args.data_folder
print("Loading data from", data_folder)
# Load all files and concatenate their contents as a single dataframe
all_files = os.listdir(data_folder)
diabetes = pd.concat((pd.read_csv(os.path.join(data_folder,csv_file)) for csv_fil

# Separate features and labels


X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsT

# Split data into training set and test set


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_

# Train a logistic regression model


print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate', np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment
joblib.dump(value=model, filename='outputs/diabetes_model.pkl')

https://compute200623.westus2.instances.azureml.net/notebooks/Users/DP100/04A - Working with Datastores.ipynb 5/11


6/23/2020 04A - Working with Datastores - Jupyter Notebook

run.complete()

Writing diabetes_training_from_datastore/diabetes_training.py

The script will load the training data from the data reference passed to it as a parameter, so now
you just need to set up the script parameters to pass the file reference when we run the
experiment.

https://compute200623.westus2.instances.azureml.net/notebooks/Users/DP100/04A - Working with Datastores.ipynb 6/11


6/23/2020 04A - Working with Datastores - Jupyter Notebook

In [9]: from azureml.train.estimator import Estimator


from azureml.core import Experiment, Environment
from azureml.widgets import RunDetails

# Create a Python environment


env = Environment("env")
env.python.user_managed_dependencies = True
env.docker.enabled = False

# Set up the parameters


script_params = {
'--regularization': 0.1, # regularization rate
'--data-folder': data_ref # data reference to download files from datastore
}

# Create an estimator
estimator = Estimator(source_directory=experiment_folder,
entry_script='diabetes_training.py',
script_params=script_params,
compute_target = 'local',
environment_definition=env
)

# Create an experiment
experiment_name = 'diabetes-training'
experiment = Experiment(workspace=ws, name=experiment_name)

# Run the experiment


run = experiment.submit(config=estimator)
# Show the run details while running
RunDetails(run).show()
run.wait_for_completion()

Run Properties

Status Completed

Start Time 6/23/2020 2:39:01 PM

Duration 0:00:31

Run Id diabetes-training_1592948340_e4c0a246

Arguments N/A

Accuracy 0.7893333333333333

AUC 0.8568655044545174

Regularization Rate 0.1

Output Logs logs/azureml/105135_azureml.log Auto-switch


O|Current working dir: /tmp/azureml_runs/diabetes-
training 1592948340 e4c0a246
https://compute200623.westus2.instances.azureml.net/notebooks/Users/DP100/04A - Working with Datastores.ipynb 7/11
6/23/2020 04A - Working with Datastores - Jupyter Notebook
training_1592948340_e4c0a246
2020-06-23
21:39:28,612|azureml.history._tracking.PythonWorkingDirectory.wor
kingdir|DEBUG|Reverting working dir from
/tmp/azureml_runs/diabetes-training_1592948340_e4c0a246 to
/tmp/azureml_runs/diabetes-training_1592948340_e4c0a246
2020-06-23
21:39:28,612|azureml.history._tracking.PythonWorkingDirectory|INF
O|Working dir is already updated /tmp/azureml_runs/diabetes-
training_1592948340_e4c0a246

Click here to see the run in Azure Machine Learning studio


(https://ml.azure.com/experiments/diabetes-training/runs/diabetes-
training_1592948340_e4c0a246?wsid=/subscriptions/13f3f409-2802-42d9-a29c-
f7b5775839d5/resourcegroups/200623/workspaces/workspace200623)

Out[9]: {'runId': 'diabetes-training_1592948340_e4c0a246',


'target': 'local',
'status': 'Finalizing',
'startTimeUtc': '2020-06-23T21:39:03.980418Z',
'properties': {'_azureml.ComputeTargetType': 'local',
'ContentSnapshotId': '9174c2d0-35bb-4ddf-99ce-f00be203eb26'},
'inputDatasets': [],
'runDefinition': {'script': 'diabetes_training.py',
'useAbsolutePath': False,
'arguments': ['--regularization',
'0.1',
'--data-folder',
'$AZUREML_DATAREFERENCE_7e8f9a0a713747c9a7fc2063f505826c'],
'sourceDirectoryDataStore': None,
'framework': 'Python',
'communicator': 'None',
'target': 'local',
'dataReferences': {'7e8f9a0a713747c9a7fc2063f505826c': {'dataStoreName': 'aml
_data',
'mode': 'Download',
'pathOnDataStore': 'diabetes-data',
'pathOnCompute': 'diabetes_data',
'overwrite': False}},
'data': {},
'outputData': {},
'jobName': None,
'maxRunDurationSeconds': None,
'nodeCount': 1,
'environment': {'name': 'env',
'version': 'Autosave_2020-06-23T21:39:01Z_8a323df0',
'python': {'interpreterPath': 'python',
'userManagedDependencies': True,
'condaDependencies': {'channels': ['anaconda', 'conda-forge'],
'dependencies': ['python=3.6.2', {'pip': ['azureml-defaults']}],
'name': 'project_environment'},
'baseCondaEnvironment': None},
'environmentVariables': {'EXAMPLE_ENV_VAR': 'EXAMPLE_VALUE'},

https://compute200623.westus2.instances.azureml.net/notebooks/Users/DP100/04A - Working with Datastores.ipynb 8/11


6/23/2020 04A - Working with Datastores - Jupyter Notebook
'docker': {'baseImage': 'mcr.microsoft.com/azureml/intelmpi2018.3-ubuntu16.0
4:20200423.v1',
'platform': {'os': 'Linux', 'architecture': 'amd64'},
'baseDockerfile': None,
'baseImageRegistry': {'address': None, 'username': None, 'password': None},
'enabled': False,
'arguments': []},
'spark': {'repositories': [], 'packages': [], 'precachePackages': True},
'inferencingStackVersion': None},
'history': {'outputCollection': True,
'directoriesToWatch': ['logs'],
'snapshotProject': True},
'spark': {'configuration': {'spark.app.name': 'Azure ML Experiment',
'spark.yarn.maxAppAttempts': '1'}},
'parallelTask': {'maxRetriesPerWorker': 0,
'workerCountPerNode': 1,
'terminalExitCodes': None,
'configuration': {}},
'amlCompute': {'name': None,
'vmSize': None,
'retainCluster': False,
'clusterMaxNodeCount': 1},
'tensorflow': {'workerCount': 1, 'parameterServerCount': 1},
'mpi': {'processCountPerNode': 1},
'hdi': {'yarnDeployMode': 'Cluster'},
'containerInstance': {'region': None, 'cpuCores': 2, 'memoryGb': 3.5},
'exposedPorts': None,
'docker': {'useDocker': False,
'sharedVolumes': True,
'shmSize': '2g',
'arguments': []},
'cmk8sCompute': {'configuration': {}},
'itpCompute': {'configuration': {}},
'cmAksCompute': {'configuration': {}}},
'logFiles': {'azureml-logs/60_control_log.txt': 'https://workspace200620667670
440.blob.core.windows.net/azureml/ExperimentRun/dcid.diabetes-training_15929483
40_e4c0a246/azureml-logs/60_control_log.txt?sv=2019-02-02&sr=b&sig=AwatpEiJnS1r
KNBGuUYi1Un3M3HhUEMqTiM%2B2VNl3z8%3D&st=2020-06-23T21%3A29%3A28Z&se=2020-06-24T
05%3A39%3A28Z&sp=r',
'azureml-logs/70_driver_log.txt': 'https://workspace200620667670440.blob.cor
e.windows.net/azureml/ExperimentRun/dcid.diabetes-training_1592948340_e4c0a246/
azureml-logs/70_driver_log.txt?sv=2019-02-02&sr=b&sig=vqzsBXXgWb6feZ%2BdlUNILu0
jkaPEDvElLU5aUpS4xqA%3D&st=2020-06-23T21%3A29%3A28Z&se=2020-06-24T05%3A39%3A28Z
&sp=r',
'logs/azureml/105135_azureml.log': 'https://workspace200620667670440.blob.cor
e.windows.net/azureml/ExperimentRun/dcid.diabetes-training_1592948340_e4c0a246/
logs/azureml/105135_azureml.log?sv=2019-02-02&sr=b&sig=W3B7aXoUhgTTbIuJf9Qg4mqo
k8o5SfYJTqjO25z%2Fpu0%3D&st=2020-06-23T21%3A29%3A28Z&se=2020-06-24T05%3A39%3A28
Z&sp=r'}}

The first time the experiment is run, it may take some time to set up the Python environment -
subsequent runs will be quicker.

https://compute200623.westus2.instances.azureml.net/notebooks/Users/DP100/04A - Working with Datastores.ipynb 9/11


6/23/2020 04A - Working with Datastores - Jupyter Notebook

When the experiment has completed, in the widget, view the output log to verify that the data files
were downloaded.

As with all experiments, you can view the details of the experiment run in Azure ML studio
(https://ml.azure.com), and you can write code to retrieve the metrics and files generated:

In [10]: # Get logged metrics


metrics = run.get_metrics()
for key in metrics.keys():
print(key, metrics.get(key))
print('\n')
for file in run.get_file_names():
print(file)

Regularization Rate 0.1


Accuracy 0.7893333333333333
AUC 0.8568655044545174

azureml-logs/60_control_log.txt
azureml-logs/70_driver_log.txt
logs/azureml/105135_azureml.log
outputs/diabetes_model.pkl

Once again, you can register the model that was trained by the experiment.

https://compute200623.westus2.instances.azureml.net/notebooks/Users/DP100/04A - Working with Datastores.ipynb 10/11


6/23/2020 04A - Working with Datastores - Jupyter Notebook

In [11]: from azureml.core import Model

# Register the model


run.register_model(model_path='outputs/diabetes_model.pkl', model_name='diabetes_
tags={'Training context':'Using Datastore'}, properties={'AUC'

# List the registered models


print("Registered Models:")
for model in Model.list(ws):
print(model.name, 'version:', model.version)
for tag_name in model.tags:
tag = model.tags[tag_name]
print ('\t',tag_name, ':', tag)
for prop_name in model.properties:
prop = model.properties[prop_name]
print ('\t',prop_name, ':', prop)
print('\n')

Registered Models:
diabetes_model version: 3
Training context : Using Datastore
AUC : 0.8568655044545174
Accuracy : 0.7893333333333333

diabetes_model version: 2
Training context : Parameterized SKLearn Estimator
AUC : 0.8483904671874223
Accuracy : 0.7736666666666666

diabetes_model version: 1
Training context : Estimator
AUC : 0.8483377282451863
Accuracy : 0.774

amlstudio-predict-diabetes version: 1
CreatedByAMLStudio : true

In this exercise, you've explored some options for working with data in the form of datastores.

Azure Machine Learning offers a further level of abstraction for data in the form of datasets, which
you'll explore next.

https://compute200623.westus2.instances.azureml.net/notebooks/Users/DP100/04A - Working with Datastores.ipynb 11/11

You might also like