Reference Guide: Sherlock Innovation Accelerator Platform - Data Science
Reference Guide: Sherlock Innovation Accelerator Platform - Data Science
1 | P a g e
At a high level, the data platform framework has four broad stages: ingest, store, process and analyze and
visualize.
Process &
Collect Store Collect
Analyze
Since S3 is priced based on use, you will only pay for what you use.
1
AWS S3 Configuration and Security
1. S3 bucket created solely for PI project
2. Access will require SSL/TSL to access (encryption in transit)
3. Access to S3 bucket restricted to Host IP address via S3 bucket Policy
4. Access restricted to AWS IAM Account created to access this S3 bucket for the purpose of the PI’s project
5. S3 Bucket will use AWS S3-SSE KMS for data encryption at rest (AES-256). An Project specific KMS key will
be created and used specifically for the PI’s project and data encryption.
a. S3 policy set to require all uploads to use S3-SSE
1. Once S3 policies are set, all changes to the S3 bucket policies will be monitored using CloudTrail
2. Security will configure CloudWatch alarms and filter on the logs created by CloudTrail.
3. Security will setup SNS notification topic (with them as subscribers) for the CloudWatch alarms
4. AWS Macie is an option for additional security but would be very expensive
2 | P a g e
S AMPLE COST OF S3 STORAGE FOR 15 TB PER MONTH .
If S3 storage is for less than month, the cost will be calculated by GB/Hr at $0.023 per GB.
Below is a diagram of a Default Baseline installation to access the environment and the S3 Files.
3 | P a g e
4 | P a g e
STAGE 3B: Processing and Analysis Options
Users are more open to change to newer or better technologies if the switching cost is minimal. The intent
of this type of project is to made available the programs they are used to using in their current
environment (i.e. Python-based or R)3 and offering them in our secure environment as host or cluster
based processing solution.
Large hosts are needed to run data analytics for a project. However these hosts are very expensive and
would be cost prohibitive to leave running while a PI sets up his software on the host. So we will setup an
inexpensive EC2 instance (t3.medium) for the PI to install and configure his software packages. This host
is temporary and once the PI has completed installation, we will create a snapshot of this EC2 and create a
custom AMI (which includes the software). This custom AMI will be used to create spot instances of an
r5.24xlarge. We will discuss this piece further below.
3
To access S3 data from within R a user would need to use the aws.s3 client package (https://CRAN.R-
project.org/package=aws.s3). A python user can access S3 by using the AWS Python SDK
(https://aws.amazon.com/sdk-for-python/).
5 | P a g e
This temporary host will have access to any GIT repos the PI needs through our Web proxy. This web
proxy only allows whitelisted sites so the PI will need to provide the GIT URLs for whitelisting. We can
also whitelist any python repos or other sites that are requested.
The storage and compute cost for this is about $67.00 a month. This can be prorated if EC2 and EBS is
not needed for a month. There will be an EBS cost for the storage of the AMI template at $0.10 a GB.
At this stage the Jumpbox (<PROJ>-MGMT-RDS), and the S3 bucket have been created. The temporary
EC2 (t3.medium) has been turned into a Custom AMI and shut down. We are now ready to launch the
large compute host. This EC2 is expensive and the intent is to only have this host running during a job
and then have it shutdown. The cost of an r5.24large is as follows.
It would be cheaper to run these processes using a Spot Instance than an On-demand instance. If you were
to run the Host for ~300 hours during a month the following would be the cost of Spot vs On-Demand:
6 | P a g e
In order to take advantage of the Spot Instance option, we will need to create an instance (and bid for the
Spot instance4) each time a process needs to be run. And the instance will need to be deleted upon
completion (~12-36 hours later when the run completes).
PI will be able to create an EC2 by executing a batch script on the Jumpbox (<PROJ>-MGMT-RDS), which
will execute a AWS Lambda function that runs a Cloudformation configured to launch an r5 instance,
using the AMI we previously created, creates a temp EBS storage volume and attaches the temp EBS
storage to the r5 Host (<PROJ>-APP-01).
If the PI would like to run CLI, the PI can SSH into the newly created EC2 instance (<PROJ>-APP-01) from
(<PROJ>-MGMT-RDS) using the PI’s AD account and run processes. If the PI would like to run Jupyter or
Zeppelin notebooks or R studio, the PI can access these on the R5 host from a web browser on <PROJ>-
MGMT-RDS.
When the processing is done, the PI will run the shutdown batch script (from <PROJ>-MGMT-RDS) which
launches another Lambda function that deletes the Cloudformation stack. When this stack is deleted, the
EC2 instance and the temp EBS volume will be deleted.
IPython Cluster
While the above option allows a PI to continue to access individual files in a way they might have been
accustomed, it suffers from the limitations of vertical scalability. A host can only get so large. If a PI
would like to continue to expand resources while still using their code base, we can offer Jupyter clusters
on demand.
A PI would still use S3 as before but now we would offer a Jupyter Notebook Server as a Cluster Master
accessible from Project RDS Jumpbox via a Web browser. This server will always be on. It is a smaller
instance as it will not be doing the computation but storing the Jupyter Notebooks and running the cluster
scheduling. Now rather than running a CloudFormation to spin up a single large node, a PI could run a
CloudFormation to spin up a cluster of spot instances as Slave nodes to the Cluster-Master (running
IPython Parallel). We would setup Chef to configure the slave-nodes. These hosts do not need to be R5
instances but can be smaller and cheaper and still achieve more than 48 cores for the cluster.5
4
We will bid for the Spot instance using the On-demand pricing as our max bid. This will guarantee when the re-
bid for the instance occurs each hour, we will not lose our Spot instance if the bidding price was raised.
5
While this is an example of the setup in Google Cloud, we would be able to use CloudFormation with Chef to
create this cluster in AWS (How to setup an IPython Parallel cluster on Google Compute Engine).
7 | P a g e
IPython Cluster Alternatives
We can also offer StarCluster6 as an alternative to IPython Parallel.
If a PI is open to using Hadoop we can automate an EMR installation using CloudFormation with
PySpark and Zeppelin Notebook7.
6
See further: http://star.mit.edu/cluster/
7
See further: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-launch.html
8
See further: https://aws.amazon.com/quicksight/
9
https://aws.amazon.com/emr/features/presto/
8 | P a g e
CLI, AWS SDKs and database connectors (jdbc/odbc)10. No servers need be setup by the user and the
user only pays for TB scanned.11
Athena Setup
Some initial setup is required before a user can use Athena. Since Athena is running Hadoop Spark in the
background, Hive External Table Schemas must be created to allow Spark to read any data on S3 that a
user would like to query using Athena. Spark needs Hive External Table Schemas because it uses schema
on read on the data stored in S3. The Data and the Schemas are separate.
So for example, you had the following files on in an S3 bucket named: “test-bucket”.
$ aws s3 ls s3://test-bucket/files/
2016-11-23 02:21:15 0
2016-11-23 00:21:21 69043364 file_a.txt.gz
2016-11-23 00:21:29 11243304 file_b.txt.gz
2016-11-23 00:21:32 10253602 file_c.txt.gz
2016-11-23 00:21:39 12256401 file_d.txt.gz
You could create a Hive external table on top of those files. You have to specify the structure of the files
by giving columns names and types and the location and all files found in that location folder would then
be part of that table.
CREATE EXTERNAL TABLE entries (name STRING, comment, count INT)
LOCATION 's3://test-bucket/files/';
Athena will store these Table Schemas in Glue’s Data Catalog.12 An AWS Account can have only one
Data Catalog. This Data Catalog services all Athena usage for all potential projects. A Data Catalog can
have many Databases. A Database is a logical collection of Tables. IAM Access Policies can be applied
to the Data Catalog, Database and Tables but cannot be applied to rows or columns or partitions within a
table. Each PI will be assigned a Database under which External Tables can be created. No other projects
will have access to the Database metadata of another project.
For a user to run a query using Athena, a user must not only have permissions to access the
Database/Table metadata but a user must have S3 permissions to the underlying data defined by location
in the Table Metadata. A user must have permissions to both the Data Catalog -> Database -> Table and
the S3 data to run a query.
Any new files added to the folder defined in an existing Hive External Table (as long as the file has the
same schema) will be included in all future queries.
10
These DB connectors can be installed on the Jumpbox and service Tableau or Cognos or any other SQL based
application.
11
Athena costs $5.00 per TB scanned. A 15TB query would cost $75.00. Therefore a PI would want to setup data in
S3 in such a way that only pertinent data is scanned per query and all data files are compressed. See further:
https://aws.amazon.com/athena/pricing/.
12
AWS Glue Limits can be found at: https://docs.aws.amazon.com/glue/latest/dg/troubleshooting-service-
limits.html
9 | P a g e
Optimizing Athena Costs
Athena charges by the TB scanned. There are several ways to structure files in S3 reduce scans and
increase query speeds.
Compress Files
Athena supports the following compression formats: SNAPPY – the default compression format for files
in the Parquet data storage format; ZLIB – the default compression format for files in the ORC data
storage format; LZO and GZIP. For data in CSV, TSV, and JSON, Athena determines the compression
type from the file extension. If it is not present, the data is not decompressed. If your data is compressed,
make sure the file name includes the compression extension, such as gz.13 If a PI were to get a 3:1
compression gain, that would reduce scans cost by two thirds.
Use Partitions in Hive Tables14
Adding Partitions to Table Schemas allow an Athena query to only scan those files that pertain to the
WHERE statement in the query. So for instance, if you have your table partitioned by day then you could
query between two dates and only the data between those two dates would be scanned. This reduces the
data that would have been scanned but never needed. Partitions work best if it reflects common range
filtering (e.g. by locations, timestamp ranges). If a PI is going to use Athena on large data in an interactive
way, a restructuring of random data into folders and creating Tables with partitions would lead to huge
cost savings.
Use Column-based Storage Formats
Most queries do not require all columns. However if you are using a row-based storage format such as
JSON or CSV, you will initially get all the columns of a row and then Athena would discard those
columns not included in the SELECT statement. This increases the cost of Athena scans. If you compress
your file and also convert it to a columnar format like Apache Parquet, achieving 3:1 compression, you
would still end up with 1 TB of data on Amazon S3. But, in this case, because Parquet is columnar,
Amazon Athena can read only the column that is relevant for the query being run. Because the query in
question only references a single column, Athena reads only that column and can avoid reading two thirds
of the file. Since Athena only reads one third of the file, it scans just 0.33TB of data from S3.15
If a PI had a Hive External Table defined in his Database that was in CSV or JSON format and he wanted
to create a new table from this data in the Parquet or ORC storage format, he could perform a CTAS
(Create Table As Select). For example:
The following example creates a CTAS query that stores the results as a text file:
13
https://docs.aws.amazon.com/athena/latest/ug/compression-formats.html
14
See Appendix 1 for more information about partition in Hive External Tables
15
https://aws.amazon.com/athena/pricing/
10 | P a g e
FROM table1;
In the following example, results are stored in Parquet, and the default results location is used:
Athena uses a default results location. Previously, this is the same location for all Athena queries in a
single AWS Account. It looks like with Athena Workgroups, we will be able to setup location and
encryption settings per project16
16
Athena Workgroups was released in Feb ’19 and this will allow Sherlock to set a default results location per
workgroup (in our case per Project) and control access controls of result sets. See further:
https://docs.aws.amazon.com/athena/latest/ug/workgroups-settings.html
17
See: https://aws.amazon.com/sagemaker/
18
https://aws.amazon.com/blogs/machine-learning/run-sql-queries-from-your-sagemaker-notebooks-using-
amazon-athena/
11 | P a g e