04A - Working With Datastores - Jupyter Notebook PDF
04A - Working With Datastores - Jupyter Notebook PDF
Important: The code in this notebooks assumes that you have completed the first
two tasks in Lab 4A. If you have not done so, go and do it now!
Note: If the authenticated session with your Azure subscription has expired since
you completed the previous exercise, you'll be prompted to reauthenticate.
Run the following code to retrieve the default datastore, and then list all of the datastores indicating
which is the default.
In [4]: ws.set_default_datastore('aml_data')
default_ds = ws.get_default_datastore()
print(default_ds.name)
aml_data
Out[5]: $AZUREML_DATAREFERENCE_b0c114d9481b4042b7d6639ffb65a9b5
The following code gets a reference to the diabetes-data folder where you uploaded the diabetes
CSV files, and specifically configures the data reference for download - in other words, it can be
used to download the contents of the folder to the compute context where the data reference is
being used. Downloading data works well for small volumes of data that will be processed on local
compute. When working with remote compute, you can also configure a data reference to mount
the datastore location and read data directly from the data source.
More Information: For more details about using datastores, see the Azure ML
documentation (https://docs.microsoft.com/azure/machine-learning/how-to-access-
data).
$AZUREML_DATAREFERENCE_7e8f9a0a713747c9a7fc2063f505826c
To use the data reference in a training script, you must define a parameter for it. Run the following
two code cells to create:
In [7]: import os
# Get parameters
parser = argparse.ArgumentParser()
parser.add_argument('--regularization', type=float, dest='reg_rate', default=0.01
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data fol
args = parser.parse_args()
reg = args.reg_rate
# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))
# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))
os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment
joblib.dump(value=model, filename='outputs/diabetes_model.pkl')
run.complete()
Writing diabetes_training_from_datastore/diabetes_training.py
The script will load the training data from the data reference passed to it as a parameter, so now
you just need to set up the script parameters to pass the file reference when we run the
experiment.
# Create an estimator
estimator = Estimator(source_directory=experiment_folder,
entry_script='diabetes_training.py',
script_params=script_params,
compute_target = 'local',
environment_definition=env
)
# Create an experiment
experiment_name = 'diabetes-training'
experiment = Experiment(workspace=ws, name=experiment_name)
Run Properties
Status Completed
Duration 0:00:31
Run Id diabetes-training_1592948340_e4c0a246
Arguments N/A
Accuracy 0.7893333333333333
AUC 0.8568655044545174
The first time the experiment is run, it may take some time to set up the Python environment -
subsequent runs will be quicker.
When the experiment has completed, in the widget, view the output log to verify that the data files
were downloaded.
As with all experiments, you can view the details of the experiment run in Azure ML studio
(https://ml.azure.com), and you can write code to retrieve the metrics and files generated:
azureml-logs/60_control_log.txt
azureml-logs/70_driver_log.txt
logs/azureml/105135_azureml.log
outputs/diabetes_model.pkl
Once again, you can register the model that was trained by the experiment.
Registered Models:
diabetes_model version: 3
Training context : Using Datastore
AUC : 0.8568655044545174
Accuracy : 0.7893333333333333
diabetes_model version: 2
Training context : Parameterized SKLearn Estimator
AUC : 0.8483904671874223
Accuracy : 0.7736666666666666
diabetes_model version: 1
Training context : Estimator
AUC : 0.8483377282451863
Accuracy : 0.774
amlstudio-predict-diabetes version: 1
CreatedByAMLStudio : true
In this exercise, you've explored some options for working with data in the form of datastores.
Azure Machine Learning offers a further level of abstraction for data in the form of datasets, which
you'll explore next.