DeathStarBench: Microservices Benchmark
DeathStarBench: Microservices Benchmark
Brett Clancy Chris Colen Fukang Wen Catherine Leung Siyuan Wang
Cornell University Cornell University Cornell University Cornell University Cornell University
bjc265@[Link] cdc99@[Link] fw224@[Link] chl66@[Link] sw884@[Link]
Leon Zaruvinsky Mateo Espinosa Rick Lin Zhongling Liu Jake Padilla
Cornell University Cornell University Cornell University Cornell University Cornell University
laz37@[Link] me326@[Link] cl2545@[Link] zl682@[Link] jsp264@[Link]
Christina
Delimitrou
Cornell University
delimitrou@[Link]
Abstract social network, a media service, an e-commerce site, a bank-
Cloud services have recently started undergoing a major ing system, and IoT applications for coordination control
shift from monolithic applications, to graphs of hundreds of of UAV swarms. We then use DeathStarBench to study the
loosely-coupled microservices. Microservices fundamentally architectural characteristics of microservices, their implica-
change a lot of assumptions current cloud systems are de- tions in networking and operating systems, their challenges
signed with, and present both opportunities and challenges with respect to cluster management, and their trade-offs in
when optimizing for quality of service (QoS) and utilization. terms of application design and programming frameworks.
In this paper we explore the implications microservices Finally, we explore the tail at scale effects of microservices in
have across the cloud system stack. We first present Death- real deployments with hundreds of users, and highlight the
StarBench, a novel, open-source benchmark suite built with increased pressure they put on performance predictability.
microservices that is representative of large end-to-end ser- CCS Concepts • Computer systems organization →
vices, modular and extensible. DeathStarBench includes a Cloud computing; • Software and its engineering →
n-tier architectures; Cloud computing.
Permission to make digital or hard copies of all or part of this work for
Keywords cloud computing, datacenters, microservices,
personal or classroom use is granted without fee provided that copies cluster management, serverless, acceleration, fpga, QoS
are not made or distributed for profit or commercial advantage and that ACM Reference Format:
copies bear this notice and the full citation on the first page. Copyrights
Yu Gan, Y. Zhang, D. Cheng, A. Shetty, P. Rathi, N. Katarki, A. Bruno,
for components of this work owned by others than the author(s) must
J. Hu, B. Ritchken, B. Jackson, K. Hu, M. Pancholi, Y. He, B. Clancy,
be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific
C. Colen, F. Wen, C. Leung, S. Wang, L. Zaruvinsky, M. Espinosa,
permission and/or a fee. Request permissions from permissions@[Link]. R. Lin, Z. Liu, J. Padilla, and C. Delimitrou. 2019. An Open-Source
ASPLOS’19, April 13–17, 2019, Providence, RI, USA Benchmark Suite for Microservices and Their Hardware-Software
© 2019 Copyright held by the owner/author(s). Publication rights licensed Implications for Cloud & Edge Systems. In Proceedings of 2019
to ACM. Architectural Support for Programming Languages and Operating
ACM ISBN 978-1-4503-6240-5/19/04. . . $15.00 Systems (ASPLOS’19). ACM, New York, NY, USA, 16 pages. https:
[Link] //[Link]/10.1145/3297858.3304013
3
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA
vices becomes, the more the modular design of microservices source applications,
helps manage system complexity. They similarly facilitate such as NGINX [13], Figure 2. Exploring the implica-
deploying, scaling, and updating individual microservices memcached [39], Mon- tions of microservices across the
independently, avoiding long development cycles, and im- goDB [12], Cylon [5], system stack.
proving elasticity. Fig. 1 shows the deployment differences and Xapian [51]. To create the end-to-end services, we built
between a traditional monolithic service, and an application custom RPC and RESTful APIs using popular open-source
built with microservices. While the entire monolith is scaled frameworks like Apache Thrift [1], and gRPC [9]. Finally,
out on multiple servers, microservices allow individual com- to track how user requests progress through microservices,
ponents of the end-to-end application to be elastically scaled, we have developed a lightweight and transparent to the
with microservices of complementary resources bin-packed user distributed tracing system, similar to Dapper [76] and
on the same physical server. Even though modularity in Zipkin [17] that tracks requests at RPC granularity, asso-
cloud services was already part of the Service-Oriented Ar- ciates RPCs belonging to the same end-to-end request, and
chitecture (SOA) design approach [77], the fine granularity records traces in a centralized database. We study both traffic
of microservices, and their independent deployment create generated by real users of the services, and synthetic loads
hardware and software challenges different from those in generated by open-loop workload generators.
traditional SOA workloads. We use these services to study the implications of microser-
Second, microservices enable programming language and vices spanning the system stack, as seen in Fig. 2. First, we
framework heterogeneity, with each tier developed in the quantify how effective current datacenter architectures are at
most suitable language, only requiring a common API for mi-
1 Named
croservices to communicate with each other; typically over after the DeathStar graphs that visualize dependencies between
microservices [18, 19].
4
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA
running microservices, as well as how datacenter hardware monolithic service. We also show that traditional autoscal-
needs to change to better accommodate their performance ing mechanisms, present in many cloud infrastructures, fall
and resource requirements (Section 4). This includes ana- short of addressing QoS violations caused by dependencies
lyzing the cycle breakdown in modern servers, examining between microservices.
whether big or small cores are preferable [25, 35, 41, 42, 46– Fourth, in Section 7, we identify microservices creating
48], determining the pressure microservices put on instruc- bottlenecks in the end-to-end service’s critical path, quantify
tion caches [37, 52], and exploring the potential they have the performance trade-offs between RPC and RESTful APIs,
for hardware acceleration [24, 27, 38, 49, 71]. We show that and explore the performance and cost implications of run-
despite the small amount of computation per microservice, ning microservices on serverless programming frameworks.
the latency requirements of each individual tier are much Finally, given that performance issues in the cloud often
stricter than for typical applications, putting more pressure only emerge at large scale [28], in Section 8 we use real
on predictably high single-thread performance. application deployments with hundreds of users to show
Second, we quan- NGINX (Lat=1293usec) memcached (Lat=186usec)
that tail-at-scale effects become more pronounced in mi-
tify the networking croservices compared to monolithic applications, as a single
and operating sys- 5.3% 19.8%
poorly-configured microservice, or slow server can degrade
tem implications of end-to-end latency by several orders of magnitude.
94.7% 80.2%
microservices. Specif- As microservices continue to evolve, it is essential for data-
ically we show that, center hardware, operating and networking systems, cluster
similarly to tradi- managers, and programming frameworks to also evolve with
MongoDB (Lat=383usec) Social Network (Lat=3827usec)
tional cloud appli- 13.6% 36.3% them, to ensure that their prevalence does not come at a per-
cations, microser- formance and/or efficiency loss. DeathStarBench is currently
vices spend a large 86.4% 63.7% used in several academic and industrial institutions with
fraction of time applications in serverless compute, hardware acceleration,
in the kernel. Un- and runtime management. We hope that open-sourcing it
Figure 3. Network (red) vs. appli-
like monolithic ser- to a wider audience will encourage more research in this
cation processing (green) for mono-
vices though, mi- emerging field.
liths and microservices.
croservices spend
much more time sending and processing network requests
over RPCs or other REST APIs. Fig. 3 shows the breakdown of 2 Related Work
execution time to network (red) and application processing Cloud applications have attracted a lot of attention over
(green) for three monolithic services (NGINX, memcached, the past decade, with several benchmark suites being re-
MongoDB) and the end-to-end Social Network application. leased both from academia and industry [37, 44, 51, 81, 88].
While for the single-tier services only a small amount of time Cloudsuite for example, includes both batch and interactive
goes towards network processing, when using microservices, services, such as memcached, and has been used to study the
this time increases to 36.3% of total execution time, causing architectural implications of cloud benchmarks [37]. Simi-
the system’s resource bottlenecks to change drastically. In larly, TailBench aggregates a set of interactive benchmarks,
Section 5 we show that offloading RPC processing to an FPGA from web servers and databases to speech recognition and
tightly-coupled with the host server, can improve network machine translation systems and proposes a new method-
performance by 10-60×. ology to analyze their performance [51]. Sirius also focuses
Third, microservices significantly complicate cluster man- on intelligent personal assistant workloads, such as voice to
agement. Even though the cluster manager can scale out indi- text translation, and has been used to study the acceleration
vidual microservices on-demand instead of the entire mono- potential for interactive ML applications [44].
lith, dependencies between microservices introduce back- A limitation of these benchmark suites is that they focus
pressure effects and cascading QoS violations that quickly on single-tier applications, or at most services with two or
propagate through the system, making performance unpre- three tiers, which drastically deviates from the way cloud
dictable. Existing cluster managers that optimize for perfor- services are deployed today. For example, even applications
mance and/or utilization [29, 32, 33, 36, 45, 60–62, 64, 66– like websearch, which is a classic multi-tier workload, are
68, 73, 80, 84] are not expressive enough to account for the configured as independent leaf nodes, which does not capture
impact each pair-wise dependency has on end-to-end per- correlations across tiers. As we show in Sections 4-7 studying
formance. In Section 6, we show that mismanaging even a the effects of microservices using existing benchmarks leads
single such dependency dramatically hurts tail latency, e.g., to fundamentally different conclusions altogether.
by 10.4× for the Social Network, and requires long periods The emergence of microservices has prompted recent
for the system to recover, compared to the corresponding work to study their characteristics and requirements [55, 78,
79, 86]. µSuite for example quantifies the system call, context
5
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA
Total New Comm. LoCs for RPC/REST Unique Per-language LoC breakdown
Service
LoCs Protocol Handwritten Autogen Microservices (end-to-end service)
Social 34% C, 23% C++, 18% Java, 7% [Link],
15,198 RPC 9,286 52,863 36
Network 6% Python, 5% Scala, 3% PHP, 2% Javascript, 2% Go
Movie 30% C, 21% C++, 20% Java, 10% PHP,
12,155 RPC 9,853 48,001 38
Reviewing 8% Scala, 5% [Link], 3% Python, 3% Javascript
E-commerce REST 4,798 - 21% Java, 16% C++, 15% C, 14% Go, 10% Javascript,
16,194 41
Website RPC 2,658 12,085 7% [Link], 5% Scala, 4% HTML, 3% Ruby
Banking 29% C, 25% Javascript, 16% Java,
13,876 RPC 4,757 31,156 34
System 16% [Link], 11% C++, 3% Python
Swarm REST 2,610 - 36% C, 19% Java, 16% Javascript,
11,283 25
Cloud RPC 4,614 21,574 14% [Link], 13% C++, 2% Python
Swarm - 29% C, 25% Javascript, 16% Java,
13,876 REST 4,757 21
Edge 16% [Link], 11% C++, 3% Python
Table 1. Characteristics and code composition of each end-to-end microservices-based application.
switch, and other OS overheads in microservices [78], while different languages mean different bottlenecks, syn-
Ueda et al. [79] show the impact of compute resource allo- chronization primitives, levels of indirection, and de-
cation, application framework, and container configuration velopment effort. The suite uses applications in low-
on the performance and scalability of several microservices. and high-level, managed and unmanaged languages in-
DeathstarBench differentiates from these studies by focusing cluding C/C++, Java, Javascript, [Link], Python, html,
on large-scale applications with tens of unique microservices, Ruby, Go, and Scala.
allowing us to study effects that only emerge at large scale, • Modularity: We follow Conway’s Law [4], i.e., the
such as network contention and cascading QoS violations fact that the software architecture of a service follows
due to dependencies between tiers, as well as by including the architecture of the company that built it in the de-
diverse applications that span social networks, media and sign of the end-to-end applications, to avoid excessive
e-commerce services, and applications running on swarms two-way communication between any two dependent
of edge devices. microservices, and to ensure they are single-concerned
and loosely-coupled.
• Reconfigurability: Easily updating components of
3 The DeathStarBench Suite a larger service is one of the main advantages of mi-
We first describe the suite’s design principles, and then present croservices. Our RPC/HTTP API allows swapping out
the architecture and functionality of each end-to-end service. microservices for alternate versions, with small changes
to existing components.
3.1 Design Principles Table 1 shows the developed LoCs per service, and the
DeathStarBench adheres to the following design principles: LoCs for the communication protocol; hand-written, and
auto-generated by Thrift, where applicable. The majority of
• Representativeness: The suite is built using popular
new code for the Social Network, Media, E-commerce, and
open-source applications deployed by cloud providers,
Banking services goes towards the cross-microservice API,
such as NGINX [13], memcached [39], MongoDB [12],
as well as a few microservices for which no open-source
RabbitMQ [15], MySQL, Apache http server, ardrone-
framework existed, e.g., assigning ratings to movies. For
autonomy [2, 5], and the Sockshop microservices by
the Swarm application, we show code breakdown for two
Weave [16]. Most new code corresponds to interfaces
versions; one where the majority of computation happens in
between the services, using Apache Thrift [1], gRPC [9],
a backend cloud (Swarm Cloud), and one where it happens
or http requests.
locally on the edge devices (Swarm Edge). We also show the
• End-to-end operation: Open-source cloud services,
number of unique microservices for each application, and the
such as memcached, can function as components of a
breakdown per programming language. Unless otherwise
larger service, but do not capture the impact of inter-
noted, all microservices run in Docker containers.
service dependencies on end-to-end performance. Death-
StarBench instead implements the full functionality of 3.2 Social Network
a service from the moment a request is generated at Scope: The end-to-end service implements a broadcast-style
the client until it reaches the service’s backend and/or social network with uni-directional follow relationships.
returns to the client. Functionality: Fig. 4 shows the architecture of the end-to-
• Heterogeneity: The software heterogeneity is both end service. Users (client)send requests over http, which
a challenge and opportunity with microservices, as first reach a load balancer, implemented with nginx. Once a
6
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA
Figure 4. The architecture (microservices dependency graph) Figure 5. The architecture of the Media Service for reviewing,
of Social Network. renting, and streaming movies.
specific webserver is selected, also in nginx, the latter uses funds, and a video streaming module using nginx-hls, a pro-
a php-fpm module to talk to the microservices responsible duction nginx module for HTTP live streaming. The actual
for composing and displaying posts, as well as microservices movie files are stored in NFS, to avoid the latency and com-
for advertisements, search engines, etc. All messages down- plexity of accessing chunked records from non-relational
stream of php-fpm are Apache Thrift RPCs [1]. Users can databases, while movie reviews are kept in memcached and
create posts embedded with text, media, links, and tags to MongoDB instances. Movie information is maintained in a
other users. Their posts are then broadcasted to all their sharded and replicated MySQL database. The application
followers. Users can also read, favorite, and repost posts, as also includes movie and advertisement recommenders, as
well as reply publicly, or send a direct message to another well as a couple auxiliary services for maintenance and ser-
user. The application also includes machine learning plugins, vice discovery, which are not shown in the figure. We are
such as ads and user recommender engines [22, 23, 53, 83], similarly deploying Media Service as a hosting site for project
a search service using Xapian [51], and microservices to demos at Cornell, which members of the community can
record and display user statistics, e.g., number of followers, browse and review.
and to allow users to follow, unfollow, or block other ac-
counts. The service’s backend uses memcached for caching, 3.4 E-Commerce Service
and MongoDB for persistent storage for posts, profiles, media, Scope: The service implements an e-commerce site for cloth-
and recommendations. Finally, the service is instrumented ing. The design draws inspiration, and uses several compo-
with a distributed tracing system (Sec. 3.7), which records nents of the open-source Sockshop application [16].
the latency of each network request and per-microservice Functionality: Fig. 6 shows the architecture of the end-
processing; traces are recorded in a centralized database. to-end service. The application front-end in this case is a
The service is broadly deployed at our institution, currently [Link] service. Clients can use the service to browse the
servicing several hundred users. We use this deployment to inventory using catalogue, a Go microservice that mines
quantify the tail at scale effects of microservices in Section 8. the back-end memcached and MongoDB instances holding
information about products. Users can also place orders
3.3 Media Service (Go) by adding items to their cart (Java). After they log
Scope: The application implements an end-to-end service in (Go) to their account, they can select shipping options
for browsing movie information, as well as reviewing, rating, (Java), process their payment (Go), and obtain an invoice
renting, and streaming movies [18, 19]. (Java) for their order. Orders are serialized and commit-
Functionality: Fig. 5 shows the architecture of the end-to- ted using QueueMaster (Go). Finally, the service includes
end service. As with the social network, a client request hits a recommender engine for suggested products, and microser-
the load balancer, which distributes requests among multiple vices for creating an item wishlist (Java), and displaying
nginx webservers. Users can search and browse information current discounts.
about movies, including their plot, photos, videos, cast, and
review information, as well as insert new reviews in the sys- 3.5 Banking System
tem for a specific movie by logging into their account. Users Scope: The service implements a secure banking system,
can also select to rent a movie, which involves a payment which users leverage to process payments, request loans, or
authentication module to verify that the user has enough balance their credit card.
7
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA
Figure 6. The architecture of the E-commerce service. Figure 7. The architecture of the Banking end-to-end service.
Client Client
http http
Figure 8. The Swam service running (a) on edge devices, and (b) on the cloud. (c) Local drone swarm executing the service.
Functionality: Users interface with a [Link] front-end, the initial route per-drone (Java service ConstructRoute),
similar to the one in the E-commerce service to login to their and holding persistent copies of sensor data. This architec-
account, search information about the bank, or contact a ture avoids the high network latency between cloud and
representative. Once logged in, a user can process a payment edge, however, it is limited by the on-board resources. The
from their account, pay their credit card or request a new one, Controller and MotionController are implemented in
browse information about loans or request one, and obtain Javascript, while ImageRecognition is using jimp, a [Link]
information about wealth management options. Most mi- library for image recognition [11], and ObstacleAvoidance
croservices are written in Java and Javascript. The back-end in C++. Services on the drones run natively, and communi-
databases consist of in-memory memcached, and persistent cate with each other over IPC, while the cloud and drones
MongoDB instances. The service also has a relational database communicate over http to avoid installing the heavy depen-
(BankInfoDB) that includes information about the bank, its dencies of Thrift on the edge devices.
services, and representatives. In the second version (Fig. 8b), the cloud is responsible for
most of the computation. It performs motion control, image
3.6 Swarm Coordination recognition, and obstacle avoidance for all drones, using the
Scope: Finally, we explore a different execution environ- ardrone-autonomy [2], and Cylon [5] libraries, in OpenCV
ment for microservices, where applications run both on the and Javascript respectively. The edge devices are only re-
cloud and on edge devices. The service coordinates the rout- sponsible for collecting sensor data and transmitting them
ing of a swarm of programmable drones, which perform to the cloud, as well as recording some diagnostics using
image recognition and obstacle avoidance. a local [Link] logging service. In this case, almost every
Functionality: We explore two version of this service. In the action suffers the cloud-edge network latency, although ser-
first (Fig. 8a), the majority of the computation happens on the vices benefit from the additional cloud resources. We use
drones, including the motion planning, image recognition, 24 programmable Parrot AR2.0 drones (a subset is seen in
and obstacle avoidance, with the cloud only constructing Fig. 8c), together with a backend cluster of 20 two-socket,
8
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA
Edge-Image Recogn. Edge-Obstance Avoid. Cloud-Image Recogn. Cloud-Obstacle Avoid. Front-end Bad Speculation Back-end Retiring
IPC
IPC
4 40 0.4 40 0.6
10
3 3 3 0.4
10 10 10
20 0.2 20
0.2
0 0.0 0 0.0
3 2 2 2
10 0 10 0 10 0 10 0
Sh ag
me Gr st
im xt
d-t db
inx
un age
u eID
om deo
nd
wr adP n
En ongo d
Mo -End
lith
mc aph
rec vi en
c s e
m qMvo nt
M o-E b
lo nd
or gin
ar s
c h
m gu t
sh me e
paipp nd
yming
En on che r
d- go d
on nd
ith
ca wis art
co lo is
m a te
re logi
m che
se er
em a ic
t d
c
o
te
in e
5 10 15 20 25 30 5 10 15 20 25 30 20 40 60 80 10 20 30 40 50
re ta hl
url serT
ol
e
me
ort
ng
no
d
nt
iqu
Queries per Second (QPS)
o
a
fro
i t e
Figure 9. Throughput-tail latency for the Swarm service
when execution happens at the edge versus the cloud. Figure 10. Cycle breakdown and IPC for the Social Network
and E-commerce services.
40-core servers. Drones communicate with each other and
the cluster over a wireless router. existing post, prepend to it, and then propagate the message
across the user’s followers’ timelines.
3.7 Methodological Challenges of Microservices In E-commerce, on the other hand, placing an order, which
A major challenge with microservices is that one cannot includes adding an item to the cart, logging in to the account,
simply rely on the client to report performance, as with tra- confirming payment, and selecting shipping, takes 1-2 orders
ditional client-server applications. Resolving performance of magnitude longer than browsing the eshop’s catalogue.
issues requires determining which microservice(s) is the cul- In reality, placing an order requires interaction with the end
prit of a QoS violation, which typically happens through user; in our case we automate the client’s decisions so they in-
distributed tracing. We developed and deployed a distributed cur zero delay, making latency server-dominated. The trends
tracing system that records per-microservice latencies at across query types are similar for the Media and Banking
RPC granularity using the Thrift timing interface. RPCs or services, with processing payments, either to rent a movie,
REST requests are timestamped upon arrival and departure or to perform a transaction in a bank account, dominating
from each microservice by the tracing module, and data is latency and defining each service’s saturation point.
accumulated by the Trace Collector, implemented simi- Finally, in Fig. 9, we compare the performance of the
larly to the Zipkin Collector [17], and stored in a centralized IoT application when computation happens at the edge ver-
Cassandra database. We additionally track the time spent sus the cloud. Since drones have to communicate with a
processing network requests, as opposed to application com- wireless router over a distance of several tens of meters,
putation using a similar methodology to [58]. We verify that latencies are significantly higher than for the cloud-only
the overhead from tracing is negligible, less than 0.1% on services. When processing happens in the cloud, latency
end-to-end latency in all cases, which is tolerable for such at low load is higher, penalized by the long network delay.
systems [26, 72, 76]. As load increases however, edge devices quickly become
oversubscribed due to the limited on-board resources, with
3.8 Provisioning & Query Diversity processing on the cloud achieving 7.8x higher throughput
Before characterizing the architectural behavior of microser- for the same tail latency, or 20x lower latency for the same
vices, we provision the end-to-end applications to ensure throughput. Obstacle avoidance shows a different trade-off,
that microservices are used in a balanced way, and that no since it is less compute-intensive, and more latency-critical.
single microservice introduces early bottlenecks due to re- Offloading obstacle avoidance to the cloud at low load can
source saturation. To do so, we start with a fair resource have catastrophic consequences if route adjustment is de-
allocation for all microservices of an end-to-end workload, layed, which highlights the importance of latency-aware
and upsize saturated microservices until all tiers saturate at resource management between cloud and edge, especially
about the same load. The ratio of resources between tiers for safety-critical computation.
varies significantly across end-to-end services, highlighting
the need for application-aware resource management. 4 Architectural Implications
Different query types also achieve different performance Methodology: We first evaluate the end-to-end services
in each service. For example, composePost requests in the on a local cluster with 20 two-socket 40-core Intel Xeon
Social Network vary in the media they embed in a message, servers (E2699-v4 and E5-2660 v3) with 128-256GB memory
ranging from text-only messages, to posts including image each, connected to a 10GBps ToR switch with 10Gbe NICs.
and video files (we keep videos within a few MBs, similar All servers are running Ubuntu 16.04, and unless otherwise
to the allowable video sizes in production social networks noted power management and turbo boosting are turned off.
like Twitter). Reposting a post incurs the longest latency Cycles breakdown and IPC: We use Intel vTune [10] to
across query types for Social Network, as it must first read an break down the cycles, and identify bottlenecks. Fig. 10
9
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA
Frequency (MHz)
1400
60 50
L1i MPKI
L1i MPKI
1600
50
40 1800
40
30 2000
30
20 20 2200
2400
10 10 0 100 200 300 400 0 100 200 300 400 0 100 200 300 400 0 100 200 300 400 0 100 200 300 400
0 0 QPS QPS QPS QPS QPS
r g
c ra t
a e
m q nvo nt
un ima xt
M o-E b
M o-Edb
te x
iq ge
ur use eID
en o
w ea log r
m rite dP in
d- go d
on nd
ith
l nd
or ogin
ar s
w ca h
m g t
sh meue
paipp nd
i eg
En on ch er
d- goed
ol d
ith
En on cheh
re cataish rt
m vi ten
em G os
co lo lis
de
ho a
in
seder
em M ic
t d
m de
ymin
on n
m a p
m ca st
lS rT
ol
e
Social Network Media Service E-commerce Banking System Swarm-Cloud
ng
nt
1000
fro
t
1200
r
Frequency (MHz)
co
1400
re
1600
1800
0 1 2
shows the IPC and cycles for each microservice in the Social Figure 12. Tail latency with increa- Tail Latency norm QoS (x1)
10 10 10
Network and E-commerce services. We omit the figures for sing load and decreasing frequency
the other services, however the observations are similar. (RAPL) for traditional monolithic cloud applications, and
Across all services a large fraction of cycles, often the ma- the five end-to-end DeathStarBench services. Lighter colors
jority, is spent in the processor front-end. Front-end stalls (yellow) denote QoS violations.
occur for several reasons, including long memory accesses
and i-cache misses. This is consistent with studies on tradi-
tional cloud applications [37, 50], although to a lesser extent
for microservices than for monolithic services (memcached, I-cache pressure: Prior work has characterized the high
mongodb), given their smaller code footprint. The majority pressure cloud applications put on the instruction caches [37,
of front-end stalls are due to fetch, while branch mispredic- 52]. Since microservices decompose what would be one large
tions account for a smaller fraction of stalls for microser- binary to many small, loosely-connected services, we exam-
vices than for other interactive applications, either cloud or ine whether previous results on i-cache pressure still hold.
IoT [37, 88]. Only a small fraction of total cycles goes towards Fig. 11 shows the MPKI of each microservice for the Social
committing instructions (21% on average for Social Network), Network and E-commerce applications. We also include the
denoting that current systems are poorly provisioned for back-end caching and database layers, as well as the corre-
microservices-based applications. sponding L1i MPKI for the monolithic implementations.
E-commerce includes a few microservices that go against First, the i-cache pressure of nginx, memcached, MongoDB,
this trend, with high IPC and high percentage of retired in- and especially the monoliths remains high, consistent with
structions, such as Search. Search (xapian [51]) is already op- prior work [37, 52, 88]. The i-cache pressure of the remaining
timized for memory locality, and has a relatively small code- microservices though is considerably lower, especially for E-
base, which explains the fewer front-end stalls. The same commerce, an expected observation given the microservices’
applies for simple microservices, such as the wishlist for small code footprints. Since [Link] applications outside
which i-cache misses are practically negligible. E-commerce the context of microservices do not have low i-cache miss
also includes a recommender engine, whose IPC is extremely rates [88], we conclude that it is the simplicity of microser-
low; this is again in agreement with studies on the archi- vices which results in better i-cache locality. Most L1i misses,
tectural behavior of ML applications [44]. The challenge especially in the Social Network happen in the kernel, and
with microservices is that although individual application are caused by Thrift. We also examined the LLC and D-TLB
components may be well understood, the structure of the misses, and found them considerably lower than for tradi-
end-to-end dependency graph defines how individual ser- tional cloud applications, which is consistent with the push
vices affect the overall performance. For both services, we for microservices to be mostly stateless.
also show the cycles breakdown and IPC for corresponding Brawny vs. wimpy cores: There has been a lot of work on
applications with the same end-to-end functionality from whether small servers can replace high-end platforms in the
the user’s perspective, but built as monoliths. In both cases, cloud [25, 46–48]. Despite the power benefits of simple cores,
monoliths are developed in Java, and include all application interactive services still achieve better latency in servers
functionality, except for the backend databases (in memcached that optimize for single-thread performance. Microservices
and MongoDB), in a single binary. The cycles breakdown is offer an appealing target for simple cores, given the small
not drastically different for monoliths compared to microser- amount of computation per microservice. We evaluate low-
vices, although they experience slightly higher percentages power machines in two ways. First, we use RAPL on our local
of committed instructions, due to reduced front-end stalls, cluster to reduce the frequency at which all microservices
as they are less likely to wait for network requests to com- run. Fig. 12 (top row) shows the change in tail latency as load
plete. IPC is also similar to microservices, and consistent with increases, and as the operating frequency decreases for five
previous studies on cloud services [37, 51]. popular, open-source single-tier interactive services: nginx,
10
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA
Percentage (%)
80
compute happens on the edge devices), the Social Network 60 cations like memcached and
and E-commerce are most sensitive to low frequency, while 40
MongoDB spend most of their
the Swarm service is the least sensitive, primarily because 20
execution time in the kernel
0C I C I C I C I C I C I
it is bound by the cloud-edge communication latency, as Social Media Ecomm. Swarm Swarm to handle interrupts, pro-
opposed to compute speed.
Net Service Banking Cloud Edge
cess TCP packets, and acti-
Apart from frequency scal- Social Net
Ecommerce
Movie Service
Swarm-Cloud Figure 14. Time in ker- vate and schedule idling in-
ing, there are platforms Banking
nel mode, user mode, and teractive services [57]. The
designed with low-power libraries for each service. large number of library cy-
Tail Latency QoS (msec)
11
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA
2.5
10
read <k,v> 1.5
50 High Load
TCP proc (RPCs) read <k,v> 1.0
8 40 NGINX memcached 0.5
6 30 NGINX memcached 0.0
4 0 10 20 30 40 50 60
20 Time (s)
2 10 B. Memcached Backpressuring NGINX
0 0 read <k,v>B. Memcached Backpressuring NGINX 12
me Gra st
im xt
d-t db
inx
un age
us eID
om eo
nd
re ogin
mo ched
Mo End
lith
mc ph
rec vid n
tw l
rvi a
rce
ing
Clo arm
Ed rm
ork
om ce
Ne ocia
url erTa
Se edi
wr adPo
te
ud
ge
En ngo
me
a
ng
no
nk
me
iqu
o-
Sw
Sw
l
8 Memcached
M
S
a
NGINX memcached
Sh
Ba
ite
6
NGINX memcached
Ec
4
2
0
Figure 15. Time in application vs network processing for (a) 0 10 20 30
Time (s)
40 50 60
DRAM DRAM 10
hotspot that autoscalers can easily address, while Case B
DRAM
10Gbps 1
shows that a seemingly negligible bottleneck in memcached
PCIe Gen3
CPU
QPI
CPU Virtex7 QSFP 10
PCIe Gen3 10
NIC -1
10
QSFP
10Gbps
tw l
rvi a
rce
ing
Clo arm
Ed arm
ork
ce
NeSocia
Se edi
Sw
Sw
M
Ba
om
Figure 16. (a) Overview of the FPGA configuration for RPC pendencies between tiers can introduce backpessure effects,
acceleration, and (b) the performance benefits of acceleration leading to system-wide hotspots [56, 59, 82, 85, 87]. Back-
in terms of network and end-to-end tail latency. pressure can additionally trick the cluster manager into pe-
nalizing or upsizing a saturated microservice, even though
its saturation is the result of backpressure from another, po-
tentially not-saturated service. Fig. 17 highlights this issue
with the Social Network experiencing a 3.2× increase in end- for a simplified two-tier application consisting of a web-
to-end tail latency. The large impact of network processing server (nginx), and an in-memory caching key-value store
occurs regardless of whether microservices communicate (memcached). In case A, as the client issues read requests, nginx
over RPCs (Social Network, Media Service, Banking), or over reaches saturation, causing its latency to increase rapidly,
HTTP (E-commerce, Swarm-Edge), although RPCs introduce and long queues to form in its input. This is a straightfor-
considerably lower latencies at low load than HTTP. Finally, ward case, which autoscaling systems can easily tackle by
Fig. 15a also shows the time the monolithic Social Network scaling out nginx, as seen in the figure at t = 14s and t = 35s.
application spends processing network requests. Both at low, Case B on the other
and especially at high load the difference is dramatic, albeit hand, highlights the
justified, since monoliths are deployed as single binaries, challenges of backpres-
with the majority of the network traffic corresponding to sure. When using HTTP1,
client-server communication. requests within a sin-
Given the prominent role network processing has on tail gle connection are block- Netflix Twitter
latency, we now examine its potential for acceleration. ing, i.e., there can only
We use a bump-in-the-wire setup, seen in Fig. 16a, and be one outstanding re-
similar to the one in [38] to offload the entire TCP stack [54, quest per connection
69, 70, 74, 75] on a Virtex 7 FPGA using Vivado HLS. The across tiers. Therefore,
FPGA is placed between the NIC and the top of rack switch
even though memcached
Amazon Social Network
(ToR), and is connected to both with matching transceivers, itself is not saturated,
it causes long queues Figure 18. Microservices
acting as a filter on the network. We maintain the PCIe con-
of outstanding requests graphs for three real produc-
nection between the host and the FPGA for accelerating other
to form ahead of nginx, tion cloud providers [6, 18, 19].
services, such as the machine learning models in the recom-
which in turn cause We also show these dependen-
mender engines, during periods of low network load. Fig. 16b
it to saturate. Current cies for Social Network.
shows the speedup from acceleration on network processing
latency alone, and on the end-to-end latency of each of the cluster managers cannot easily address this case, as a
services. Network processing latency improves by 10 − 68x utilization-based autoscaling scheme would scale out nginx,
over native TCP, while end-to-end tail latency improves by which is budy waiting and appears saturated. As seen in the
43% and up to 2.2x. For interactive, latency-critical services, figure, not only does this not solve the problem, but can po-
where even a small improvement in tail latency is significant, tentially make it worse, by admitting even more traffic into
network acceleration provides a major boost in performance. the system. Even without the connection blocking in HTTP1,
12
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA
Microservices Instances
Microservices Instances
Detection
3
Figure 19. Cascading QoS violations in Social Network com- Figure 20. (a) Microservices taking longer than monoliths
pared to per-microservice CPU utilization. to recover from a QoS violation, even (b) in the presence of
autoscaling mechanisms.
13
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA
minutes. The margins of box plots show the 25t h and 75t h la-
$2.08
Amazon EC2
Tail Latency (ms)
$3.65
AWS Lambda (S3)
2
AWS Lambda (mem) $4.56
$2.19 $14.8
tency percentiles, while the whiskers show the 5t h and 95t h .
10
In Lambda, we show performance and cost both for the de-
$3.16 $4.02
$2.85 $6.87
$37.6
$21.6
1
$28.8
$3.93 $24.1 $5.02
fault persistent storage (S3), and for a configuration that uses
10
the memory of four additional EC2 instances to maintain
Social
Network
Media
Service
Ecommerce Banking
System
Swarm
Cloud intermediate state passed through dependent microservices.
25 500 Latency is considerably higher for Lambda when using
S3, primarily due to the overhead and rate limiting of the re-
EC2
Tail Latency (ms)
14
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA
Back-end
10
3
1.0 Request Skew 1.0 Impact of slow servers: Fig. 22c shows the impact a small
Micro (40)
number of slow servers has on overall QoS as cluster size in-
15
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA
References Lanka, Friedel van Megen, Dima Mukhortov, Prerak Patel, Brandon
[1] [n. d.]. Apache Thrift. [Link] Perez, Amanda Rapsang, Steven K. Reinhardt, Bita Rouhani, Adam
[2] [n. d.]. ardrone-autonomy. [Link] Sapek, Raja Seera, Sangeetha Shekar, Balaji Sridharan, Gabriel Weisz,
en/latest/. Lisa Woods, Phillip Yi Xiao, Dan Zhang, Ritchie Zhao, and Doug Burger.
[3] [n. d.]. AWS Lambda. [Link] 2018. Serving DNNs in Real Time at Datacenter Scale with Project
[4] [n. d.]. Conway’s Law. [Link] Brainwave. IEEE Micro 38, 2 (2018), 8–20. [Link]
[Link]. 2018.022071131
[5] [n. d.]. [Link]. [Link] [28] Jeffrey Dean and Luiz Andre Barroso. [n. d.]. The Tail at Scale. In
[6] [n. d.]. Decomposing Twitter: Adventures in Service- CACM, Vol. 56 No. 2, Pages 74-80.
Oriented Architecture. [Link] [29] Christina Delimitrou and Christos Kozyrakis. [n. d.]. Paragon: QoS-
decomposing-twitter-adventures-in-serviceoriented-architecture. Aware Scheduling for Heterogeneous Datacenters. In Proceedings of
[7] [n. d.]. Finagle: An extensible RPC system for the JVM. [Link] the Eighteenth International Conference on Architectural Support for
[Link]/finagle. Programming Languages and Operating Systems (ASPLOS). Houston,
[8] [n. d.]. fission: Serverless Functions for Kubernetes. [Link] TX, USA, 2013.
[9] [n. d.]. gRPC: A high performance open-source universal RPC frame- [30] Christina Delimitrou and Christos Kozyrakis. [n. d.]. QoS-Aware Sched-
work. [Link] uling in Heterogeneous Datacenters with Paragon. In ACM Transac-
[10] [n. d.]. Intel VTune Amplifier. [Link] tions on Computer Systems (TOCS), Vol. 31 Issue 4. December 2013.
intel-vtune-amplifier-xe. [31] Christina Delimitrou and Christos Kozyrakis. [n. d.]. Quality-
[11] [n. d.]. jimp: An image processing library in [Link] with zero external of-Service-Aware Scheduling in Heterogeneous Datacenters with
dependencies. [Link] Paragon. In IEEE Micro Special Issue on Top Picks from the Computer
[12] [n. d.]. mongoDB. [Link] Architecture Conferences. May/June 2014.
[13] [n. d.]. NGINX. [Link] [32] Christina Delimitrou and Christos Kozyrakis. [n. d.]. Quasar: Resource-
[14] [n. d.]. OpenLambda. [Link] Efficient and QoS-Aware Cluster Management. In Proceedings of the
[15] [n. d.]. RabbitMQ. [Link] Nineteenth International Conference on Architectural Support for Pro-
[16] [n. d.]. SockShop: A Microservices Demo Application. [Link] gramming Languages and Operating Systems (ASPLOS). Salt Lake City,
[Link]/blog/sock-shop-microservices-demo-application. UT, USA, 2014.
[17] [n. d.]. Zipkin. [Link] [33] Christina Delimitrou and Christos Kozyrakis. 2016. HCloud: Resource-
[18] 2016. The Evolution of Microservices. [Link] Efficient Provisioning in Shared Cloud Systems. In Proceedings of the
adriancockcroft/evolution-of-microservices-craft-conference. Twenty First International Conference on Architectural Support for Pro-
[19] Adrian Cockroft [n. d.]. Microservices Workshop: Why, what, gramming Languages and Operating Systems (ASPLOS).
and how to get there. [Link] [34] Christina Delimitrou and Christos Kozyrakis. 2017. Bolt: I Know
microservices-workshop-craft-conference. What You Did Last Summer... In The Cloud. In Proceedings of the
[20] autoscaleLimit [n. d.]. AWS Autoscaling. [Link] Twenty Second International Conference on Architectural Support for
autoscaling/. Programming Languages and Operating Systems (ASPLOS).
[21] Luiz Barroso and Urs Hoelzle. 2009. The Datacenter as a Computer: An [35] Christina Delimitrou and Christos Kozyrakis. 2018. Amdahl’s Law for
Introduction to the Design of Warehouse-Scale Machines. MC Publishers. Tail Latency. In Communications of the ACM (CACM).
[22] Robert Bell, Yehuda Koren, and Chris Volinsky. 2007. The BellKor 2008 [36] Christina Delimitrou, Daniel Sanchez, and Christos Kozyrakis. 2015.
Solution to the Netflix Prize. Technical Report. Tarcil: Reconciling Scheduling Speed and Quality in Large Shared Clus-
[23] Leon Bottou. [n. d.]. Large-Scale Machine Learning with Stochastic ters. In Proceedings of the Sixth ACM Symposium on Cloud Computing
Gradient Descent. In Proceedings of the International Conference on (SOCC).
Computational Statistics (COMPSTAT). Paris, France, 2010. [37] Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos,
[24] Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel
Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Popescu, Anastasia Ailamaki, and Babak Falsafi. [n. d.]. Clearing
Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin the Clouds: A Study of Emerging Scale-out Workloads on Modern
Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Hardware. In Proceedings of the Seventeenth International Conference
Chiou, and Doug Burger. 2016. A Cloud-scale Acceleration Archi- on Architectural Support for Programming Languages and Operating
tecture. In The 49th Annual IEEE/ACM International Symposium on Systems (ASPLOS). London, England, UK, 2012, 12. [Link]
Microarchitecture (MICRO-49). IEEE Press, Piscataway, NJ, USA, Arti- 1145/2150976.2150982
cle 7, 13 pages. [Link] [38] Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou,
[25] Shuang Chen, Shay Galon, Christina Delimitrou, Srilatha Manne, and Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu,
Jose F. Martinez. 2017. Workload Characterization of Interactive Cloud Adrian Caulfield, Eric Chung, Harish Kumar Chandrappa, Somesh
Services on Big and Small Server Platforms. In Proc. of IISWC. Chaturmohta, Matt Humphrey, Jack Lavier, Norman Lam, Fengfen
[26] Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F. Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel,
Wenisch. 2014. The Mystery Machine: End-to-end Performance Tejas Sapre, Mark Shaw, Gabriel Silva, Madhan Sivakumar, Nisheeth
Analysis of Large-scale Internet Services. In Proceedings of the 11th Srivastava, Anshuman Verma, Qasim Zuhair, Deepak Bansal, Doug
USENIX Conference on Operating Systems Design and Implementa- Burger, Kushagra Vaid, David A. Maltz, and Albert Greenberg. 2018.
tion (OSDI’14). USENIX Association, Berkeley, CA, USA, 217–231. Azure Accelerated Networking: SmartNICs in the Public Cloud. In
[Link] 15th USENIX Symposium on Networked Systems Design and Implemen-
[27] Eric S. Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, tation (NSDI 18). USENIX Association, Renton, WA, 51–66. https:
Adrian M. Caulfield, Todd Massengill, Ming Liu, Daniel Lo, Shlomi //[Link]/conference/nsdi18/presentation/firestone
Alkalay, Michael Haselman, Maleen Abeydeera, Logan Adams, Hari [39] Brad Fitzpatrick. [n. d.]. Distributed caching with memcached. In
Angepat, Christian Boehn, Derek Chiou, Oren Firestein, Alessandro Linux Journal, Volume 2004, Issue 124, 2004.
Forin, Kang Su Gatlin, Mahdi Ghandi, Stephen Heil, Kyle Holohan, [40] Jason Flinn. September 2012. Cyber Foraging: Bridging Mobile and Cloud
Ahmad El Husseini, Tamás Juhász, Kara Kagi, Ratna Kovvuri, Sitaram Computing. Synthesis Lectures on Mobile and Pervasive Computing.
16
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA
[41] Yu Gan and Christina Delimitrou. 2018. The Architectural Implications [53] Krzysztof C. Kiwiel. [n. d.]. Convergence and efficiency of subgradient
of Cloud Microservices. In Computer Architecture Letters (CAL), vol.17, methods for quasiconvex minimization. In Mathematical Programming
iss. 2. (Series A) (Berlin, Heidelberg: Springer) 90 (1): pp. 1-25, 2001.
[42] Vishal Gupta and Karsten Schwan. [n. d.]. Brawny vs. Wimpy: Evalua- [54] David Koeplinger, Raghu Prabhakar, Yaqi Zhang, Christina Delim-
tion and Analysis of Modern Workloads on Heterogeneous Processors. itrou, Christos Kozyrakis, and Kunle Olukotun. 2016. Automatic
In Proceedings of IEEE International Symposium on Parallel & Distributed Generation of Efficient Accelerators for Reconfigurable Hardware.
Processing (IPDPS). Boston, MA, 2013. In 43rd ACM/IEEE Annual International Symposium on Computer Ar-
[43] Ragib Hasan, Md. Mahmud Hossain, and Rasib Khan. 2015. Aura: chitecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016. 115–127.
An IoT Based Cloud Infrastructure for Localized Mobile Computation [Link]
Outsourcing. In 3rd IEEE International Conference on Mobile Cloud [55] Nane Kratzke and Peter-Christian Quint. 2016. Ppbench. In Proceedings
Computing, Services, and Engineering, MobileCloud. San Francisco, CA, of the 6th International Conference on Cloud Computing and Services
183–188. [Link] Science - Volume 1 and 2 (CLOSER 2016). SCITEPRESS - Science and
[44] Johann Hauswald, Michael A. Laurenzano, Yunqi Zhang, Cheng Li, Technology Publications, Lda, Portugal, 223–231.
Austin Rovinski, Arjun Khurana, Ronald G. Dreslinski, Trevor Mudge, [56] Chien-An Lai, Josh Kimball, Tao Zhu, Qingyang Wang, and Calton Pu.
Vinicius Petrucci, Lingjia Tang, and Jason Mars. 2015. Sirius: An Open 2017. milliScope: A Fine-Grained Monitoring Framework for Perfor-
End-to-End Voice and Vision Personal Assistant and Its Implications mance Debugging of n-Tier Web Services. In 37th IEEE International
for Future Warehouse Scale Computers. In Proceedings of the Twentieth Conference on Distributed Computing Systems, ICDCS 2017, Atlanta,
International Conference on Architectural Support for Programming GA, USA, June 5-8, 2017. 92–102.
Languages and Operating Systems (ASPLOS ’15). ACM, New York, NY, [57] Jacob Leverich and Christos Kozyrakis. [n. d.]. Reconciling High Server
USA, 223–238. [Link] Utilization and Sub-millisecond Quality-of-Service. In Proceedings of
[45] Ben Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. EuroSys. Amsterdam, The Netherlands, 2014.
Joseph, Randy Katz, Scott Shenker, and Ion Stoica. [n. d.]. Mesos: A [58] Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble.
Platform for Fine-Grained Resource Sharing in the Data Center. In 2014. Tales of the Tail: Hardware, OS, and Application-level Sources of
Proceedings of NSDI. Boston, MA, 2011. Tail Latency. In Proceedings of the ACM Symposium on Cloud Computing
[46] Urs Hölzle. [n. d.]. Brawny cores still beat wimpy cores, most of the (SOCC ’14). ACM, New York, NY, USA, Article 9, 14 pages.
time. In IEEE Micro. 2010. [59] Jack Li, Qingyang Wang, Chien-An Lai, Junhee Park, Daisaku
[47] Vijay Janapa Reddi, Benjamin C. Lee, Trishul Chilimbi, and Kushagra Yokoyama, and Calton Pu. 2014. The Impact of Software Resource
Vaid. [n. d.]. Mobile Processors for Energy-Efficient Web Search. In Allocation on Consolidated n-Tier Applications. In 2014 IEEE 7th Inter-
ACM Transactions on Computer Systems, Vol. 29, No. 4, Article 9. 2011. national Conference on Cloud Computing, Anchorage, AK, USA, June 27
[48] Vijay Janapa Reddi, Benjamin C. Lee, Trishul Chilimbi, and Kusha- - July 2, 2014. 320–327.
gra Vaid. 2010. Web Search Using Mobile Cores: Quantifying and [60] Ching-Chi Lin, Pangfeng Liu, and Jan-Jan Wu. [n. d.]. Energy-
Mitigating the Price of Efficiency. In Proceedings of the 37th Annual In- Aware Virtual Machine Dynamic Provision and Scheduling for Cloud
ternational Symposium on Computer Architecture (ISCA ’10). ACM, New Computing. In Proceedings of the 2011 IEEE 4th International Confer-
York, NY, USA, 314–325. [Link] ence on Cloud Computing (CLOUD). Washington, DC, USA, 2011, 2.
[49] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gau- [Link]
rav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Bo- [61] David Lo, Liqun Cheng, Rama Govindaraju, Luiz André Barroso, and
den, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Christos Kozyrakis. [n. d.]. Towards Energy Proportionality for Large-
Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, scale Latency-critical Workloads. In Proceedings of the 41st Annual
Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, International Symposium on Computer Architecuture (ISCA). Minneapo-
Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert lis, MN, 2014.
Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexan- [62] David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ran-
der Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen ganathan, and Christos Kozyrakis. [n. d.]. Heracles: Improving Re-
Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris source Efficiency at Scale. In Proc. of the 42Nd Annual International
Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adri- Symposium on Computer Architecture (ISCA). Portland, OR, 2015.
ana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi [63] Anil Madhavapeddy, Richard Mortier, Charalampos Rotsos, David
Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omer- Scott, Balraj Singh, Thomas Gazagnaire, Steven Smith, Steven Hand,
nick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, and Jon Crowcroft. 2013. Unikernels: Library Operating Systems for
Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew the Cloud. In Proceedings of the Eighteenth International Conference
Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gre- on Architectural Support for Programming Languages and Operating
gory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Systems (ASPLOS ’13). ACM, New York, NY, USA, 461–472. https:
Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. //[Link]/10.1145/2451116.2451167
In-Datacenter Performance Analysis of a Tensor Processing Unit. In [64] Jason Mars and Lingjia Tang. [n. d.]. Whare-map: heterogeneity in
Proceedings of the 44th Annual International Symposium on Computer "homogeneous" warehouse-scale computers. In Proceedings of ISCA.
Architecture (ISCA ’17). ACM, New York, NY, USA, 1–12. Tel-Aviv, Israel, 2013.
[50] Svilen Kanev, Juan Darago, Kim Hazelwood, Parthasarathy Ran- [65] David Meisner, Christopher M. Sadler, Luiz André Barroso, Wolf-
ganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. 2014. Pro- Dietrich Weber, and Thomas F. Wenisch. 2011. Power management
filing a warehouse-scale computer. In ISCA ’15 Proceedings of the 42nd of online data-intensive services. In Proceedings of the 38th annual
Annual International Symposium on Computer Architecture. 158–169. international symposium on Computer architecture. 319–330.
[51] Harshad Kasture and Daniel Sanchez. 2016. TailBench: A Benchmark [66] Ripal Nathuji, Canturk Isci, and Eugene Gorbatov. [n. d.]. Exploiting
Suite and Evaluation Methodology for Latency-Critical Applications. platform heterogeneity for power efficient data centers. In Proceedings
In Proceedings of the IEEE International Symposium on Workload Char- of ICAC. Jacksonville, FL, 2007.
acterization (IISWC). [67] Ripal Nathuji, Aman Kansal, and Alireza Ghaffarkhah. [n. d.]. Q-
[52] Cansu Kaynak, Boris Grot, and Babak Falsafi. 2013. SHIFT: shared his- Clouds: Managing Performance Interference Effects for QoS-Aware
tory instruction fetch for lean-core server processors. In The 46th An-
nual IEEE/ACM International Symposium on Microarchitecture (MICRO-
46). Davis, CA, 272–283. [Link]
17
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA
Clouds. In Proceedings of EuroSys. Paris,France, 2010. Workload Characterization, IISWC 2018, Raleigh, NC, USA, September
[68] Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. [n. 30 - October 2, 2018. 1–12.
d.]. Sparrow: Distributed, Low Latency Scheduling. In Proceedings of [79] Takanori Ueda, Takuya Nakaike, and Moriyoshi Ohara. [n. d.]. Work-
SOSP. Farminton, PA, 2013. load characterization for microservices. In Proc. of IISWC. 2016.
[69] Raghu Prabhakar, David Koeplinger, Kevin J. Brown, HyoukJoong Lee, [80] Abhishek Verma, Luis Pedrosa, Madhukar R. Korupolu, David Op-
Christopher De Sa, Christos Kozyrakis, and Kunle Olukotun. 2016. penheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster
Generating Configurable Hardware from Parallel Patterns. In Pro- management at Google with Borg. In Proceedings of the European
ceedings of the Twenty-First International Conference on Architectural Conference on Computer Systems (EuroSys). Bordeaux, France.
Support for Programming Languages and Operating Systems, ASPLOS [81] Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang,
’16, Atlanta, GA, USA, April 2-6, 2016. 651–665. Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang,
[70] Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matthew Feldman, Chen Zheng, Gang Lu, Kent Zhan, Xiaona Li, and Bizhu Qiu. 2014.
Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and BigDataBench: A big data benchmark suite from internet services.
Kunle Olukotun. 2017. Plasticine: A Reconfigurable Architecture For 2014 IEEE 20th International Symposium on High Performance Com-
Parallel Paterns. In Proceedings of the 44th Annual International Sympo- puter Architecture (HPCA) 00 (2014), 488–499. [Link]
sium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June [Link]/10.1109/HPCA.2014.6835958
24-28, 2017. 389–402. [Link] [82] Qingyang Wang, Chien-An Lai, Yasuhiko Kanemasa, Shungeng Zhang,
[71] Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, and Calton Pu. 2017. A Study of Long-Tail Latency in n-Tier Systems:
Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fow- RPC vs. Asynchronous Invocations. In 37th IEEE International Confer-
ers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, ence on Distributed Computing Systems, ICDCS 2017, Atlanta, GA, USA,
Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James June 5-8, 2017. 207–217.
Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi [83] Ian H. Witten, Eibe Frank, and Geoffrey Holmes. [n. d.]. Data Mining:
Xiao, and Doug Burger. 2014. A Reconfigurable Fabric for Accelerat- Practical Machine Learning Tools and Techniques. 3rd Edition.
ing Large-Scale Datacenter Services. In Proc. of the 41st Intl. Symp. on [84] Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. [n. d.].
Computer Architecture. Bubble-flux: precise online QoS management for increased utilization
[72] Gang Ren, Eric Tune, Tipp Moseley, Yixin Shi, Silvius Rus, and in warehouse scale computers. In Proceedings of ISCA. 2013.
Robert Hundt. 2010. Google-Wide Profiling: A Continuous Pro- [85] Hailong Yang, Quan Chen, Moeiz Riaz, Zhongzhi Luan, Lingjia Tang,
filing Infrastructure for Data Centers. IEEE Micro (2010), 65–79. and Jason Mars. 2017. PowerChief: Intelligent Power Allocation for
[Link] Multi-Stage Applications to Improve Responsiveness on Power Con-
[73] Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and strained CMP. In Proceedings of the 44th Annual International Sympo-
John Wilkes. [n. d.]. Omega: flexible, scalable schedulers for large sium on Computer Architecture (ISCA ’17). ACM, New York, NY, USA,
compute clusters. In Proceedings of EuroSys. Prague, Czech Republic, 133–146.
2013. [86] Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chenjie Xu, Chao Ji, and
[74] D. Sidler, G. Alonso, M. Blott, K. Karras, Kees Vissers, and Raymond Wenyun Zhao. 2018. Benchmarking Microservice Systems for Soft-
Carley. [n. d.]. Scalable 10Gbps TCP/IP Stack Architecture for Recon- ware Engineering Research. In Proceedings of the 40th International
figurable Hardware. In Proceedings of FCCM. 2015. Conference on Software Engineering: Companion Proceeedings (ICSE ’18).
[75] D. Sidler, Z. Istvan, and G. Alonso. [n. d.]. Low-Latency TCP/IP Stack ACM, New York, NY, USA, 323–324.
for Data Center Applications. In Proceedings of FPL. 2016. [87] Tao Zhu, Jack Li, Josh Kimball, Junhee Park, Chien-An Lai, Calton Pu,
[76] Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat and Qingyang Wang. 2017. Limitations of Load Balancing Mechanisms
Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chan- for N-Tier Systems in the Presence of Millibottlenecks. In 37th IEEE
dan Shanbhag. 2010. Dapper, a Large-Scale Distributed Systems Tracing International Conference on Distributed Computing Systems, ICDCS
Infrastructure. Technical Report. Google, Inc. [Link] 2017, Atlanta, GA, USA, June 5-8, 2017. 1367–1377.
com/archive/papers/[Link] [88] Yuhao Zhu, Daniel Richins, Matthew Halpern, and Vijay Janapa Reddi.
[77] David Sprott and Lawrence Wilkes. January 2004. Understanding 2015. Microarchitectural Implications of Event-driven Server-side Web
Service-Oriented Architecture, CBDI Forum. Applications. In Proceedings of the 48th International Symposium on
[78] Akshitha Sriraman and Thomas F. Wenisch. 2018. uSuite: A Bench- Microarchitecture (MICRO-48). ACM, New York, NY, USA, 762–774.
mark Suite for Microservices. In 2018 IEEE International Symposium on [Link]
18
Microservices-based applications face challenges with front-end stalls being the primary bottleneck, similar to monolithic services, though to a lesser extent . They require better provisioning for latency-critical constraints, since each microservice must maintain predictability under varying loads . Additionally, the dependencies and interaction between microservices can propagate QoS violations and take longer to recover due to autoscaling inefficiencies, which contrasts sharp recovery times in monolithic services .
Detailed profiling of warehouse-scale computers reveals that tail latencies, front-end bottlenecks, and the architectural need for efficient single-thread performance are critical for handling future cloud applications. The findings highlight that future designs need to address predictability and efficiency in handling microservices, which demand robust single-core performance amidst varying load conditions .
Microservices have better i-cache locality compared to traditional monolithic applications due to their simplicity, resulting in lower i-cache miss rates . Despite this, microservices experience more front-end stalls mainly due to fetch and suffer from similar bottlenecks as traditional applications, albeit to a lesser degree . However, microservices are more sensitive to poor single-thread performance, especially since they need to adhere to stricter tail latency constraints compared to monolithic applications .
Emerging scale-out workloads, including microservices, are significantly impacted by architectural features such as cache misses and inter-core communication. While microservices typically have better i-cache locality due to simplicity, any increase in inter-core communication or increase in cache misses disproportionately affects performance due to their sensitivity to predictability and latency .
Autoscaling mechanisms in microservices take longer to address QoS violations compared to monolithic applications, due to their inability to rapidly identify and allocate resources to the direct source of the performance issue. This delay results in longer queue times and slower recovery from violations, whereas monolithic applications can quickly rebalance loads by instantiating new copies .
Cloud-native microservices-based applications, such as Social Network and E-commerce, distribute computation unevenly across tiers. At low load levels, front-end components like nginx are dominant in latency contribution, but high load shifts bottlenecks to back-end elements like databases. These shifts necessitate robust load management strategies to prevent QoS violations, which microservices find challenging due to their dependency networks .
Microservices exhibit different performance imbalances across tiers based on load. At low load, latency is generally dominated by the front-end, like nginx. However, at high loads, back-end databases and services like writeGraph become primary performance limiters. In e-commerce, additional performance impacts arise due to compute-intensive microservices in high-level languages, with services such as orders, catalogue, and payment being major latency contributors .
'Brawny' cores, which are designed for single-thread performance, generally provide better latency for interactive services despite the power advantages of 'wimpy' cores. Microservices, due to their small computational footprints and stricter tail latency constraints, benefit more from 'brawny' cores which enhance performance predictability .
The sensitivity of a system to frequency scaling is influenced by its reliance on cloud-edge communication. Systems like Swarm, bound by cloud-edge latency rather than computation speed, show less sensitivity to frequency scaling compared to services like Social Network and E-commerce, which become more sensitive as they are limited by compute speed .
Latency-aware resource management is crucial for safety-critical computations because offloading to the cloud can lead to significant delays that may have catastrophic consequences, especially in latency-critical tasks such as obstacle avoidance where timely route adjustments are essential .