AWS Service Reference Notes

The following notes cover the AWS services most commonly involved in PinPole optimisation work. They supplement the Engineering Notes available in each service's Node Configuration panel.

Lambda

Cold start, warm start, and the execution model

Lambda does not maintain a persistent server. When a request arrives and no warm execution environment is available, Lambda must provision one. This provisioning time — selecting a host, loading the runtime, initialising your function code — is the cold start.

Cold start duration depends on:

Runtime: Node.js (V8) has the shortest cold start. Python is faster than Java. Java has the longest cold start of the major runtimes. For latency-sensitive workloads, prefer Node.js or Python.
Memory allocation: Higher memory allocations correlate with faster initialisation because more CPU is available during startup.
Deployment package size: Larger packages take longer to load. Keep dependencies minimal.

Under a Spike traffic pattern in PinPole, cold start behaviour is directly visible: Lambda latency will spike sharply before stabilising as warm instances fill the concurrency pool.

Provisioned vs. reserved concurrency

These are two distinct controls that are frequently confused:

Setting	What it does	When to use
Reserved concurrency	Sets the maximum number of concurrent instances. Acts as a hard cap — requests beyond this limit are throttled. Also reserves capacity from the account pool.	When you need to guarantee a function cannot consume more than a defined share of account concurrency, or when you need to protect downstream services from being overwhelmed.
Provisioned concurrency	Pre-initialises a specific number of execution environments, keeping them warm and ready to respond with no cold start delay. A subset of reserved concurrency.	When cold start latency is unacceptable — typically for user-facing APIs under spike or burst load.

Account concurrency limit

The default Lambda concurrency limit per region per AWS account is 1,000 concurrent executions. Of this, up to 900 can be reserved across all functions (100 is held as unreserved headroom). Provisioned concurrency for any single function is bounded by its reserved concurrency allocation.

If you have two Lambda functions and want to give each 1,000 provisioned concurrency instances, you cannot — the total available is 900 across all functions in the account. If your architecture genuinely requires more than 1,000 concurrent executions, request a quota increase from AWS.

Push vs. pull invocation models

Lambda can receive events in two distinct ways, and the connection type on the canvas reflects which model is in use:

Push model: The event source (e.g. API Gateway, SNS) invokes Lambda directly and synchronously. Lambda must respond within the invocation timeout. High-throughput push invocations hit Lambda's concurrency limit fastest.
Pull model: Lambda polls a source (e.g. SQS, Kinesis, DynamoDB Streams) and processes batches of messages. Lambda controls the consumption rate. This model naturally smooths traffic spikes.

When SQS is placed before Lambda, the connection uses the pull model. Lambda polls the queue and processes messages in batches — this is what makes SQS an effective traffic buffer.

DynamoDB

Partition key design and hot partitions

DynamoDB distributes data across internal partitions using a hash of the partition key. Each partition sustains approximately 3,000 read capacity units (RCUs) and 1,000 write capacity units (WCUs) per second. If too many requests target items with the same partition key, all those requests hit a single partition — this is a hot partition, and it will throttle even if your total provisioned capacity is sufficient.

Principles for avoiding hot partitions:

Choose a high-cardinality partition key. User ID and email address are good choices — they distribute load evenly across many partitions.
Avoid status fields (e.g. pending, active, cancelled) as partition keys — they have very few distinct values, so most traffic concentrates on a handful of partitions.
Avoid keys that encode time at coarse granularity (e.g. a date-only field). All writes on a given day will hit the same partition.

Check partition key design before simulating

PinPole's DynamoDB node configuration surfaces the partition key and sort key settings. Before running your first simulation, check whether your access patterns are consistent with the key design. If the primary access pattern does not align with your partition key, you will need either a GSI or a key redesign.

Global Secondary Indexes (GSIs)

A GSI creates an alternate partition key on the same table, allowing queries on non-primary attributes without a full table scan. DynamoDB supports up to 5 GSIs per table. Each GSI has its own partition key, and the same hot-partition rules apply — a poorly designed GSI partition key can throttle independently of the main table.

DAX (DynamoDB Accelerator)

DAX is an in-memory cache that sits between your application and DynamoDB. It is most valuable for read-heavy workloads with repeated access to the same items. At high RPS — such as a recommendation service serving popular titles — the same DynamoDB items may be read thousands of times per second. Without DAX, each read consumes RCUs and adds DynamoDB latency. With DAX, repeated reads are served from the cache at microsecond latency with no RCU consumption.

Enable DAX when your workload is read-heavy and your data has reasonable temporal locality (recently accessed items are likely to be accessed again soon).

On-demand vs. provisioned capacity

Mode	How it scales	Cost profile	Best for
On-demand	Scales automatically to any traffic level	Higher cost per RCU/WCU; no idle waste	Unpredictable or spiky traffic; early-stage design where access patterns are not yet known
Provisioned	Fixed RCU/WCU allocation; auto-scaling available as an add-on	Lower cost when tuned correctly; risk of throttling if under-provisioned	Predictable, stable workloads with well-understood access patterns

During PinPole simulation sessions, start with on-demand to avoid unexpected throttling signals that obscure other bottlenecks. Once your architecture is stable and your access patterns are clear, switch to provisioned and use the simulation to tune the capacity allocation.

API Gateway

Throttling and request limits

API Gateway applies throttling at two levels: account-level defaults and per-stage/per-route rate limits. The default account-level limit varies by region but is typically 10,000 RPS steady-state with a burst of 5,000.

At very high RPS (100K+), API Gateway will throttle even with a well-tuned architecture. The correct response at that scale is architectural — reduce the synchronous load that API Gateway must carry — rather than purely configurational. See Pattern 6.

API Gateway caching

API Gateway supports response caching at the stage level, with TTL configurable per route. For recommendation APIs or other read-heavy endpoints where the same response is valid for multiple users over a short window, enabling API Gateway caching can dramatically reduce Lambda invocations and cost. Combine with CloudFront for maximum cache hit rate.

SQS

Visibility timeout

The visibility timeout controls how long a message is hidden from other consumers while being processed by a Lambda function. If Lambda does not delete the message before the timeout expires, it becomes visible again and may be processed a second time.

Visibility timeout rule

Set visibility timeout to at least 1.5× your Lambda function's maximum execution time. If your Lambda timeout is 30 seconds, set SQS visibility timeout to at least 45–60 seconds.

Standard vs. FIFO queues

Queue type	Ordering	Throughput	When to use
Standard	Best-effort (not guaranteed)	Nearly unlimited	Most workloads where order does not matter
FIFO	Strict first-in-first-out	Up to 3,000 messages per second with batching	When strict ordering is required (e.g. financial transactions, ordered event streams)

For most recommendation, analytics, and processing workloads, standard queues are appropriate. FIFO queues add cost and reduce throughput — use them only when ordering is a hard requirement.

CloudFront

Cache modes

CloudFront's cache behaviour in PinPole is configurable via the cache mode setting:

Mode	TTLs	Best for
Aggressive	Long TTLs, maximum cache hit rate	Static content or slowly changing data
Balanced	Moderate TTLs	Most API responses where data changes over minutes to hours
Minimum	Short TTLs, low cache hit rate	When data freshness is critical

Lambda@Edge

Lambda@Edge allows you to run Lambda functions at CloudFront edge locations, enabling request/response modification, authentication, and routing logic at the edge without a round trip to the origin. In PinPole, Lambda@Edge is available as a separate node that connects to a CloudFront distribution.

When combined with a Next.js or similar SSR application hosted on Lambda, this creates an edge-hosted architecture where CloudFront serves cached responses and Lambda@Edge handles dynamic rendering — a highly scalable pattern for content-heavy applications.

Lambda​

Cold start, warm start, and the execution model​

Provisioned vs. reserved concurrency​

Push vs. pull invocation models​

DynamoDB​

Partition key design and hot partitions​

Global Secondary Indexes (GSIs)​

DAX (DynamoDB Accelerator)​

On-demand vs. provisioned capacity​

API Gateway​

Throttling and request limits​

API Gateway caching​

SQS​

Visibility timeout​

Standard vs. FIFO queues​

CloudFront​

Cache modes​

Lambda@Edge​