S3
If you are studying for AWS Solutions Architect Professional Exam, this guide will help you with quick revision before the exam. it can use as study notes for your preparation.
Dashboard Other Certification NotesS3
Storage Classes
- S3 Standards (default):
- The objects are stored in at least 3 AZs
- Provides eleven nines of availability
- The replication is using MD5 file checks together with CRCs to detect object issues
- When objects are stored in S3 using the API, a HTTP 200 OK response is provided
- Billing:
- GB/month of data stored in S3
- A dollar for GB charge transfer out (in is free)
- Price per 1000 requests
- No specific retrieval fee, no minimum duration, no minimum size
- S3 standard makes data accessible immediately, can be used for static website hosting
- Should be used for data frequently accessed
- S3 Standard-IA:
- Shares most of the characteristics of S3 standard: objects are replicated in 3 AZs, durability is the same, availability is the same, first byte latency is the same, objects can be made publicly available
- Billing:
- It is more cost effective for storing data
- Data transfer fee is the same as S3 standard
- Retrieval fee: for every GB of data there is a retrieval fee, overall cost may increase with frequent data access
- Minimum duration charge: we will be billed for a minimum of 30 days, minimum capacity of the objects being 128KB (smaller objects will be billed as being 128 KB)
- Should be used for long lived data where data access is infrequent
- S3 One Zone-IA:
- Similar to S3 standard, but cheaper. Also cheaper than S3 standard IA
- Data stored using this class is only stored in one region
- Billing:
- Similar to S3 standard IA: similar minimum duration fee of 30 days, similar billing for smaller objects and also similar retrieval fee per GB
- Same level of durability (if the AZ does not fail)
- Data is replicated inside one AZ
- Since data is not replicated between AZs, this storage class is not HA. It should be used for non-critical data or for data that can be reproduced easily
- S3 Glacier:
- Same data replication as S3 standard and S3 standard IA
- Same durability characteristics
- Storage cost is about 1/5 of S3 standard
- S3 objects stored in Glacier should be considered cold objects (should not be accessed frequently)
- Objects in Glacier class are just pointers to real objects and they can not be made public
- In order to retrieve them, we have to perform a retrieval process:
- A job that needs to be done to get access to objects
- Retrievals processes are billed
- When objects are retrieved for Glacier, they are temporarily stored in standard IA and they are removed after a while. We can retrieve them permanently as well
- Retrieval process types:
- Expedited: objects are retrieved in 1-5 minutes, retrieval process being the most expensive
- Standard: data is accessible at 3-5 hours
- Bulk: data is accessible at 5-12 hours at lower cost
- Glacier has a 40KB minimum billable size and a 90 days minimum duration for storage
- Glacier should be used for data archival, where data can be retrieved in minutes to hours
- S3 Glacier Deep Archive:
- Approximately 1/4 of the price of standard Glacier
- Deep Archive represents data in a frozen state
- Has a 40KB minimum billable data size and a 180 days minimum duration for data storage
- Objects can not be made publicly available, data access is similar to standard Glacier class
- Restore jobs are longer:
- Standard: up to 12 hours
- Bulk: up to 48 hours
- Should be used for archival which is very rarely accessed
- S3 Intelligent-Tiering:
- It is a storage class containing 4 different tiering a storage
- Objects that are access frequently are stored in the Frequent Access tier, less frequently accessed objects are stored in the Infrequent Access tier. Objects accessed very infrequently will be stored in either Archive or Deep Archive tier
- We don’t have to worry for moving objects over tier, this is done by the storage class automatically
- Intelligent tier can be configured, archiving data is optional and can be enabled/disabled
- There is no retrieval cost for moving data between frequent and infrequent tiers, we will be billed based on the automation cost per 1000 objects
- S3 Intelligent-Tiering is recommended for unknown or uncertain data access usage
- Storage classes comparison:
S3 Standard | S3 Intelligent-Tiering | S3 Standard-IA | S3 One Zone-IA | S3 Glacier | S3 Glacier Deep Archive | |
---|---|---|---|---|---|---|
Designed for durability | 99.999999999% (11 9’s) | 99.999999999% (11 9’s) | 99.999999999% (11 9’s) | 99.999999999% (11 9’s) | 99.999999999% (11 9’s) | 99.999999999% (11 9’s) |
Designed for availability | 99.99% | 99.9% | 99.9% | 99.5% | 99.99% | 99.99% |
Availability SLA | 99.9% | 99% | 99% | 99% | 99.9% | 99.9% |
Availability Zones | ≥3 | ≥3 | ≥3 | 1 | ≥3 | ≥3 |
Minimum capacity charge per object | N/A | N/A | 128KB | 128KB | 40KB | 40KB |
Minimum storage duration charge | N/A | 30 days | 30 days | 30 days | 90 days | 180 days |
Retrieval fee | N/A | N/A | per GB retrieved | per GB retrieved | per GB retrieved | per GB retrieved |
First byte latency | milliseconds | milliseconds | milliseconds | milliseconds | select minutes or hours | select hours |
Storage type | Object | Object | Object | Object | Object | Object |
Lifecycle transitions | Yes | Yes | Yes | Yes | Yes | Yes |
S3 Lifecycle Configuration
- We can create lifecycle rules on S3 buckets which can move objects between tiers or expire objects automatically
- A lifecycle configuration is a set of rules applied to a bucket or a group of objects in a bucket
- Rules consist of actions:
- Transition actions: move objects from one tier to another after a certain time
- Expiration actions: delete objects or versions of objects
- Objects can not be moved based on how much they are accessed, this can be done by the intelligent tiering. We can move objects based on time passed
- By moving objects from one tier to another we can save costs, expiring objects also will help saving costs
- Transitions between tiers:
- Considerations:
- Smaller objects cost more in Standard-IA, One Zone-IA, etc.
- An objects needs to remain for at least 30 days in standard tier before being able to be moved to infrequent tiers (objects can be uploaded manually in infrequent tiers)
- A single rule can not move objects instantly from standard IA to infrequent tiers and then to Glacier tiers. Objects have to stay for at least 30 days in infrequent tiers before being able to be moved by one rule only. In order ot overcome this, we can define 2 different rules
S3 Replication
- 2 types of replication are supported by S3:
- Cross-Region Replication (CRR)
- Same-Region Replication (SRR)
- Both types of replication support same account replication and cross-account replication
- If we configure cross-account replication, we have to define a policy on the destination account to allow replication from the source account
- We can create replicate all objects from a bucket or we can create rules for a subset of objects
- We can specify which storage class to use for an object in the destination bucket
- We can also define the ownership of the objects in the destination bucket. By default it will be the same as the owner in the source bucket
- Replication Time Control (RTC): if enabled ensures a 15 minutes replication of objects
- Replication consideration:
- Replication is not retroactive: only newer objects are replicated after the replication is enabled
- Versioning needs to be enabled for replication
- Replication is one-way only
- Replication is capable of handling objects encrypted with SSE-S3 and SSE-KMS. SSE-C is not supported for replication
- Replication requires for the owner of source bucket needs permissions on the objects which will be replicated
- System events will not be replicated
- Any objects in the Glacier and Glacier Deep Archive will not be replicated
- Deletion are not replicated!
- Replication use cases:
- SRR:
- Log aggregation
- PROD and Test sync
- Resilience with strict sovereignty
- CRR
- Global resilience improvements
- Latency reduction
- SRR:
S3 Encryption
- Buckets aren’t encrypted, objects inside buckets are encrypted
- Encryption at rest types:
- Client-Side encryption: data is encrypted before it leaves the client
- Server-Side encryption: data is encrypted at the server side, it is sent on plain-text format from the client
- Both encryption types use encryption in-transit for communication
- There are 3 types of server-side encryption supported:
- SSE-C: server-side encryption with customer-provided keys
- Customer is responsible for managing the keys, S3 managed encryption
- When an object is put into S3, we need to provide the key utilized
- The object will be encrypted by the key, a hash is generated and stored for the key
- The key will be discarded after the encryption is done
- In case of object retrieval, we need to provide the key again
- SSE-S3: server-side encryption with Amazon S3-managed keys
- AWS handles both the encryption/decryption and the key management
- When using this method, S3 creates a master key for the encryption process (handled entirely by S3)
- When an object is uploaded an unique key is used for encryption. After the encryption, the with the master key, the unique key is encrypted as well and the unencrypted key is discarded. Both the key and the object are stored
- For most situations, this is the default type of encryption. It uses AES-256 algorithm, they key management is entirely handled bt S3
- SSE-KMS: Server-side encryption with customer-managed keys stored in AWS Key Management Service (KMS)
- Similar to SSE-S3, but for this method the KMS handles stored keys
- When an object is uploaded for the first time, S3 will communicate with KMS an creates a customer master key (CMK). This is default master key used in the future
- When new objects are uploaded AWS uses the CMK to generate individual keys for encryption (data encryption keys). The data encryption key will be stored along with the object in encrypted format
- We don’t have to use the default CMK provided by AWS, we can use our own CMK. We can control the permission on it and how it is regulated
- SSE-KMS provides role separation:
- We can specify who can access the CMK from KMS
- Administrators can administers buckets but they may not have access to KMS keys
- SSE-C: server-side encryption with customer-provided keys
- Default Bucket Encryption:
- When an object is uploaded, we can specify which server-side encryption to be used by adding a header to the request:
x-amz-server-side-encryption
- When this header is not specified, objects wont be encrypted, although we can have a default encryption method on the bucket level, which will be used in case this header is missing
- Values for the header:
- To use SSE-S3:
AES256
- To use SSE-KMS:
aws:kms
- To use SSE-S3:
- Default encryption can not be used to restrict encryption type, we can use a bucket policy for that
- When an object is uploaded, we can specify which server-side encryption to be used by adding a header to the request:
S3 Presigned URLs
- Is a way to give other people access to our buckets using our credentials
- An IAM admin can generate a presigned URL for a specific object using his credentials. This URL will have an expiry date
- The presigned URL can be given to unauthenticated uses in order to access the object
- The user will interact with S3 using the presigned URL as if it was the person who generated the presigned URL
- Presigned URLs can be used for downloads and for uploads
- Presigned URLs can be used for giving direct access private files to an application user offloading load from the application. This approach will require a service account for the application which will generate the presigned URLs
- Presigned URL considerations:
- We can create a presigned ULR for objects we don’t have access to
- When using the URL, the permissions match the identity which generated it. The permissions are evaluated at the moment of accessing the object (it might happen the the identity had its permissions revoked, meaning we wont have access to the object either)
- We should not generate presigned URLs generated on temporary credentials (assuming an IAM role). When the temporary credentials are expired, the presigned URL will stop working as well
S3 Select and Glacier Select
- Are ways to retrieve parts of objects instead of entire objects
- S3 can store huge objects (up to 5 TB)
- Retrieving a huge objects will take time and consume transfer capacity
- S3/Glacier provides services to access partial objects using SQL-like statements to select parts of objects
- Both S3 Select and Glacier selects supports the following formats: CSV, JSON, Parquet, BZIP2 compression for CSV and JSON
S3 Access Points
- Improves the manageability of objects when buckets are used for many different teams or they contain objects for a large amount of functions
- They simplify the process of managing access to S3 buckets/objects
- Rather than 1 bucket (1 bucket policy) access we can create many access points with different policies
- Each access point can be limited from where it can be accessed, and each can have different network access controls
- Each access point has its own endpoint address
- We can create access point using the console or the CLI using
aws s3control create-access-point --name < name > --account-id < account-id > --bucket < bucket-name >
- Any permission defined on the access point needs to be defined on the bucket policy as well
S3 Block Public Access
- The Amazon S3 Block Public Access feature provides settings for access points, buckets, and accounts to help manage public access to Amazon S3 resources
- The settings we can configure with the Block Public Access Feature are:
- IgnorePublicAcls: this prevents any new ACLs to be created or existing ACLs being modified which enable public access to the object. With this alone existing ACLs will not be affected
- BlockPublicAcls: Any ACLs actions that exist with public access will be ignored, this does not prevent them being created but prevents their effects
- BlockPublicPolicy: This prevents a bucket policy containing public actions from being created or modified on an S3 bucket, the bucket itself will still allow the existing policy
- RestrictPublicBuckets: this will prevent non AWS services or authorized users (such as an IAM user or role) from being able to publicly access objects in the bucket
S3 Cost Saving Options
- S3 Select and Glacier Select: save in network a CPU cost by retrieving ony the necessary data
- S3 Lifecycle Rules: transition objects between tiers
- Compress objects to save space
- S3 Requester Pays:
- In general, bucket owners pay for all Amazon S3 storage and data transfer costs associated with their bucket
- With Requester Pays buckets, the requester instead of the bucket owner pays the cost of the request and the data download from the bucket
- The bucket owner always pays the cost of storing data
- Helpful when we want to share large datasets with other accounts
- Requires a bucket policy
- If an IAM role is assumed, the owner account of that role pays for the request!
S3 Object Lock
- Object Lock can be enabled on newly created S3 buckets. For existing ones in order to enable Object Lock we have to contact AWS support
- Versioning will be also enabled when Object Lock is enabled
- Object Lock can not be disabled, versioning can not be suspended when Object Lock is active
- Object Lock is a Write-Once-Read-Many (WORM) architecture: when an object is written, can not be modified
- There are 2 ways S3 managed object retention:
- Retention Period
- Legal Hold
- Objects can have both retention periods enabled
- Object Lock retentions can be individually defined on object versions, a bucket can have default Object Lock settings
Retention Period
- When a retention period is enabled on an object, we specify the days and years for the period
- The retention period will end after the period
- There are 2 types of retention period modes:
- Compliance mode:
- Object can not be adjusted, deleted or overwritten. The retention period can not be reduced, the retention mode can not be adjusted even by the account root user
- Should be used for compliance reasons
- Governance mode:
- Objects can not be adjusted, deleted or overwritten, but special permissions can be added to some identities to allow for the lock setting to be adjusted
- This identities should have the
s3:BypassGovernanceRetention
permission - The governance mode can be overwritten when passing
x-amz-bypass-governance-retention:true
header (header is default for console ui)
- Compliance mode:
Legal Hold
- We don’t set a retention period for this type of retention, Legal Hold can be on or off for specific versions of an object
- We can’t delete or overwrite an object with Legal Hold
- An extra permission is required when we want to add or remove the Legal Hold on an object:
s3:PutObjectLegalHold
- Legal Hold can be used for preventing accidental removals
Amazon Macie
- It is a data security and data privacy service
- Macie is a service to discover, monitor and protect data stored in S3 buckets
- Once enabled and pointed to buckets, Macie will automatically discover data and categorize it as PII, PHI, Finance etc.
- Macie is using data identifier. There are 2 types of data identifier:
- Managed Data Identifier: built-in, can use machine learning, pattern matching to analyze and discover data. It is designed to detect sensitive data from many countries
- Custom Data Identifier: created by clients, they are proprietary to accounts and they are regex based
- Discovery Jobs: these jobs will use data identifiers to manage and search for sensitive content. They will generate findings which can be used for integration with other AWS service (ex: Event Bridge) in order to do automatic remediation
- Macie uses multi account architecture: one account is the master account which can manage other accounts to discover sensitive data
Macie Identifiers
- Data Discovery Jobs: analyzes data in order to determine wether the objects contain sensitive data. This is done using data identifiers
- Managed Data Identifiers:
- Created and managed by AWS
- Can be used to detect a growing list of common sensitive data types: credentials, financial data, health data, personal identifiers (addresses, passports, etc.)
- Custom Data Identifiers:
- Can be created by us, AWS account users/owners
- They are using regex patterns to match data
- We can add optional keywords: optional sequences that need to be in the proximity to regex match
- Maximum Match Distance: how close keywords are to regex pattern
- We can also include ignore words
Macie Findings
- Macie will produce 2 types of findings:
- Policy Findings: are generated when the policies or settings are changed in a way that reduces the security of the bucket after Macie is enabled
- Sensitive Data Findings: generated when sensitive data is identified based on identifiers