0% found this document useful (0 votes)
245 views16 pages

DeathStarBench: Microservices Benchmark

The document presents DeathStarBench, an open-source benchmark suite designed to study the implications of microservices in cloud and edge systems. It highlights the shift from monolithic applications to microservices, emphasizing their architectural characteristics, challenges in cluster management, and performance trade-offs. The paper also explores the effects of microservices on networking, operating systems, and hardware requirements, ultimately aiming to enhance performance predictability in real-world deployments.

Uploaded by

xuyihua2017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
245 views16 pages

DeathStarBench: Microservices Benchmark

The document presents DeathStarBench, an open-source benchmark suite designed to study the implications of microservices in cloud and edge systems. It highlights the shift from monolithic applications to microservices, emphasizing their architectural characteristics, challenges in cluster management, and performance trade-offs. The paper also explores the effects of microservices on networking, operating systems, and hardware requirements, ultimately aiming to enhance performance predictability in real-world deployments.

Uploaded by

xuyihua2017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA

An Open-Source Benchmark Suite for Microservices and


Their Hardware-Software Implications for Cloud & Edge Systems
Yu Gan Yanqi Zhang Dailun Cheng
Cornell University Cornell University Cornell University
yg397@[Link] yz2297@[Link] dc924@[Link]

Ankitha Shetty Priyal Rathi Nayan Katarki Ariana Bruno Justin Hu


Cornell University Cornell University Cornell University Cornell University Cornell University
aas394@[Link] pr348@[Link] nk646@[Link] amb633@[Link] jh2625@[Link]

Brian Ritchken Brendon Jackson Kelvin Hu Meghna Pancholi Yuan He


Cornell University Cornell University Cornell University Cornell University Cornell University
bjr96@[Link] btj28@[Link] sh2442@[Link] mp832@[Link] yh772@[Link]

Brett Clancy Chris Colen Fukang Wen Catherine Leung Siyuan Wang
Cornell University Cornell University Cornell University Cornell University Cornell University
bjc265@[Link] cdc99@[Link] fw224@[Link] chl66@[Link] sw884@[Link]

Leon Zaruvinsky Mateo Espinosa Rick Lin Zhongling Liu Jake Padilla
Cornell University Cornell University Cornell University Cornell University Cornell University
laz37@[Link] me326@[Link] cl2545@[Link] zl682@[Link] jsp264@[Link]

Christina
Delimitrou
Cornell University
delimitrou@[Link]
Abstract social network, a media service, an e-commerce site, a bank-
Cloud services have recently started undergoing a major ing system, and IoT applications for coordination control
shift from monolithic applications, to graphs of hundreds of of UAV swarms. We then use DeathStarBench to study the
loosely-coupled microservices. Microservices fundamentally architectural characteristics of microservices, their implica-
change a lot of assumptions current cloud systems are de- tions in networking and operating systems, their challenges
signed with, and present both opportunities and challenges with respect to cluster management, and their trade-offs in
when optimizing for quality of service (QoS) and utilization. terms of application design and programming frameworks.
In this paper we explore the implications microservices Finally, we explore the tail at scale effects of microservices in
have across the cloud system stack. We first present Death- real deployments with hundreds of users, and highlight the
StarBench, a novel, open-source benchmark suite built with increased pressure they put on performance predictability.
microservices that is representative of large end-to-end ser- CCS Concepts • Computer systems organization →
vices, modular and extensible. DeathStarBench includes a Cloud computing; • Software and its engineering →
n-tier architectures; Cloud computing.

Permission to make digital or hard copies of all or part of this work for
Keywords cloud computing, datacenters, microservices,
personal or classroom use is granted without fee provided that copies cluster management, serverless, acceleration, fpga, QoS
are not made or distributed for profit or commercial advantage and that ACM Reference Format:
copies bear this notice and the full citation on the first page. Copyrights
Yu Gan, Y. Zhang, D. Cheng, A. Shetty, P. Rathi, N. Katarki, A. Bruno,
for components of this work owned by others than the author(s) must
J. Hu, B. Ritchken, B. Jackson, K. Hu, M. Pancholi, Y. He, B. Clancy,
be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific
C. Colen, F. Wen, C. Leung, S. Wang, L. Zaruvinsky, M. Espinosa,
permission and/or a fee. Request permissions from permissions@[Link]. R. Lin, Z. Liu, J. Padilla, and C. Delimitrou. 2019. An Open-Source
ASPLOS’19, April 13–17, 2019, Providence, RI, USA Benchmark Suite for Microservices and Their Hardware-Software
© 2019 Copyright held by the owner/author(s). Publication rights licensed Implications for Cloud & Edge Systems. In Proceedings of 2019
to ACM. Architectural Support for Programming Languages and Operating
ACM ISBN 978-1-4503-6240-5/19/04. . . $15.00 Systems (ASPLOS’19). ACM, New York, NY, USA, 16 pages. https:
[Link] //[Link]/10.1145/3297858.3304013

3
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA

1 Introduction remote procedure calls (RPC) [1, 7, 9] or a RESTful API. In


Large-scale datacenters host an increasing number of pop- contrast, monoliths limit the languages used for development,
ular online cloud services that span all aspects of human and make frequent updates cumbersome and error-prone.
endeavor. Many of these applications are interactive, latency- Finally, microservices simplify correctness and perfor-
critical services that must meet strict performance (through- mance debugging, as bugs can be isolated in specific tiers,
put and tail latency), and availability constraints, while also unlike monoliths, where resolving bugs often involves trou-
handling frequent software updates [21, 28–34, 36, 44, 51, 61, bleshooting the entire service. This makes them additionally
62, 65]. The effort to satisfy these often contradicting require- applicable to internet-of-things (IoT) applications, that often
ments has pushed datacenter applications on the verge of host mission-critical computation, which puts more pressure
a major design shift, from complex monolithic services that on correctness verification [40, 43].
encompass the entire application functionality in a single Despite their advantages, microservices represent a signif-
binary, to graphs with tens or hundreds of single-purpose, icant departure from the way cloud services are traditionally
loosely-coupled microservices. This shift is becoming increas- designed, and have broad implications ranging from cloud
ingly pervasive with large cloud providers, such as Amazon, management and programming frameworks, to operating
Twitter, Netflix, Apple, and EBay having already adopted systems and datacenter hardware design.
the microservices application model [6, 18, 19], and Net- In this paper we explore the implications microservices
flix reporting more than 200 unique microservices in their have across the cloud system stack, from hardware all the
ecosystem, as of the end of 2016 [18, 19]. way to application design, using a suite of new end-to-end
The increasing pop- and representative applications built with tens of microser-
Monolith Microservices ularity of microser- vices each. The DeathStarBench suite 1 includes six end-to-
vices is justified by end services that cover a wide spectrum of popular cloud
several reasons. First, and edge services: a social network, a media service (movie
they promote com- reviewing, renting, streaming), an e-commerce site, a secure
posable software de- banking system, and Swarm; an IoT service for coordination
sign, simplifying and control of drone swarms, with and without a cloud backend.
accelerating develop- Each service includes
5. Tail at Scale Implications
ment, with each mi- tens of microservices
croservice being re- in different languages
4. Application/Programming
sponsible for a small and programming mod- Framework Implications
Figure 1. Differences in the de- subset of the appli- els, including [Link], 3. Cluster Management Implications
ployment of monoliths and mi- cation’s functionality. Python, C/C++, Java,
2. OS/Network Implications
croservices. The richer the func- Javascript, Scala, and
tionality of cloud ser- Go, and leverages open- 1. Hardware Implications

vices becomes, the more the modular design of microservices source applications,
helps manage system complexity. They similarly facilitate such as NGINX [13], Figure 2. Exploring the implica-
deploying, scaling, and updating individual microservices memcached [39], Mon- tions of microservices across the
independently, avoiding long development cycles, and im- goDB [12], Cylon [5], system stack.
proving elasticity. Fig. 1 shows the deployment differences and Xapian [51]. To create the end-to-end services, we built
between a traditional monolithic service, and an application custom RPC and RESTful APIs using popular open-source
built with microservices. While the entire monolith is scaled frameworks like Apache Thrift [1], and gRPC [9]. Finally,
out on multiple servers, microservices allow individual com- to track how user requests progress through microservices,
ponents of the end-to-end application to be elastically scaled, we have developed a lightweight and transparent to the
with microservices of complementary resources bin-packed user distributed tracing system, similar to Dapper [76] and
on the same physical server. Even though modularity in Zipkin [17] that tracks requests at RPC granularity, asso-
cloud services was already part of the Service-Oriented Ar- ciates RPCs belonging to the same end-to-end request, and
chitecture (SOA) design approach [77], the fine granularity records traces in a centralized database. We study both traffic
of microservices, and their independent deployment create generated by real users of the services, and synthetic loads
hardware and software challenges different from those in generated by open-loop workload generators.
traditional SOA workloads. We use these services to study the implications of microser-
Second, microservices enable programming language and vices spanning the system stack, as seen in Fig. 2. First, we
framework heterogeneity, with each tier developed in the quantify how effective current datacenter architectures are at
most suitable language, only requiring a common API for mi-
1 Named
croservices to communicate with each other; typically over after the DeathStar graphs that visualize dependencies between
microservices [18, 19].

4
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA

running microservices, as well as how datacenter hardware monolithic service. We also show that traditional autoscal-
needs to change to better accommodate their performance ing mechanisms, present in many cloud infrastructures, fall
and resource requirements (Section 4). This includes ana- short of addressing QoS violations caused by dependencies
lyzing the cycle breakdown in modern servers, examining between microservices.
whether big or small cores are preferable [25, 35, 41, 42, 46– Fourth, in Section 7, we identify microservices creating
48], determining the pressure microservices put on instruc- bottlenecks in the end-to-end service’s critical path, quantify
tion caches [37, 52], and exploring the potential they have the performance trade-offs between RPC and RESTful APIs,
for hardware acceleration [24, 27, 38, 49, 71]. We show that and explore the performance and cost implications of run-
despite the small amount of computation per microservice, ning microservices on serverless programming frameworks.
the latency requirements of each individual tier are much Finally, given that performance issues in the cloud often
stricter than for typical applications, putting more pressure only emerge at large scale [28], in Section 8 we use real
on predictably high single-thread performance. application deployments with hundreds of users to show
Second, we quan- NGINX (Lat=1293usec) memcached (Lat=186usec)
that tail-at-scale effects become more pronounced in mi-
tify the networking croservices compared to monolithic applications, as a single
and operating sys- 5.3% 19.8%
poorly-configured microservice, or slow server can degrade
tem implications of end-to-end latency by several orders of magnitude.
94.7% 80.2%
microservices. Specif- As microservices continue to evolve, it is essential for data-
ically we show that, center hardware, operating and networking systems, cluster
similarly to tradi- managers, and programming frameworks to also evolve with
MongoDB (Lat=383usec) Social Network (Lat=3827usec)

tional cloud appli- 13.6% 36.3% them, to ensure that their prevalence does not come at a per-
cations, microser- formance and/or efficiency loss. DeathStarBench is currently
vices spend a large 86.4% 63.7% used in several academic and industrial institutions with
fraction of time applications in serverless compute, hardware acceleration,
in the kernel. Un- and runtime management. We hope that open-sourcing it
Figure 3. Network (red) vs. appli-
like monolithic ser- to a wider audience will encourage more research in this
cation processing (green) for mono-
vices though, mi- emerging field.
liths and microservices.
croservices spend
much more time sending and processing network requests
over RPCs or other REST APIs. Fig. 3 shows the breakdown of 2 Related Work
execution time to network (red) and application processing Cloud applications have attracted a lot of attention over
(green) for three monolithic services (NGINX, memcached, the past decade, with several benchmark suites being re-
MongoDB) and the end-to-end Social Network application. leased both from academia and industry [37, 44, 51, 81, 88].
While for the single-tier services only a small amount of time Cloudsuite for example, includes both batch and interactive
goes towards network processing, when using microservices, services, such as memcached, and has been used to study the
this time increases to 36.3% of total execution time, causing architectural implications of cloud benchmarks [37]. Simi-
the system’s resource bottlenecks to change drastically. In larly, TailBench aggregates a set of interactive benchmarks,
Section 5 we show that offloading RPC processing to an FPGA from web servers and databases to speech recognition and
tightly-coupled with the host server, can improve network machine translation systems and proposes a new method-
performance by 10-60×. ology to analyze their performance [51]. Sirius also focuses
Third, microservices significantly complicate cluster man- on intelligent personal assistant workloads, such as voice to
agement. Even though the cluster manager can scale out indi- text translation, and has been used to study the acceleration
vidual microservices on-demand instead of the entire mono- potential for interactive ML applications [44].
lith, dependencies between microservices introduce back- A limitation of these benchmark suites is that they focus
pressure effects and cascading QoS violations that quickly on single-tier applications, or at most services with two or
propagate through the system, making performance unpre- three tiers, which drastically deviates from the way cloud
dictable. Existing cluster managers that optimize for perfor- services are deployed today. For example, even applications
mance and/or utilization [29, 32, 33, 36, 45, 60–62, 64, 66– like websearch, which is a classic multi-tier workload, are
68, 73, 80, 84] are not expressive enough to account for the configured as independent leaf nodes, which does not capture
impact each pair-wise dependency has on end-to-end per- correlations across tiers. As we show in Sections 4-7 studying
formance. In Section 6, we show that mismanaging even a the effects of microservices using existing benchmarks leads
single such dependency dramatically hurts tail latency, e.g., to fundamentally different conclusions altogether.
by 10.4× for the Social Network, and requires long periods The emergence of microservices has prompted recent
for the system to recover, compared to the corresponding work to study their characteristics and requirements [55, 78,
79, 86]. µSuite for example quantifies the system call, context

5
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA

Total New Comm. LoCs for RPC/REST Unique Per-language LoC breakdown
Service
LoCs Protocol Handwritten Autogen Microservices (end-to-end service)
Social 34% C, 23% C++, 18% Java, 7% [Link],
15,198 RPC 9,286 52,863 36
Network 6% Python, 5% Scala, 3% PHP, 2% Javascript, 2% Go
Movie 30% C, 21% C++, 20% Java, 10% PHP,
12,155 RPC 9,853 48,001 38
Reviewing 8% Scala, 5% [Link], 3% Python, 3% Javascript
E-commerce REST 4,798 - 21% Java, 16% C++, 15% C, 14% Go, 10% Javascript,
16,194 41
Website RPC 2,658 12,085 7% [Link], 5% Scala, 4% HTML, 3% Ruby
Banking 29% C, 25% Javascript, 16% Java,
13,876 RPC 4,757 31,156 34
System 16% [Link], 11% C++, 3% Python
Swarm REST 2,610 - 36% C, 19% Java, 16% Javascript,
11,283 25
Cloud RPC 4,614 21,574 14% [Link], 13% C++, 2% Python
Swarm - 29% C, 25% Javascript, 16% Java,
13,876 REST 4,757 21
Edge 16% [Link], 11% C++, 3% Python
Table 1. Characteristics and code composition of each end-to-end microservices-based application.

switch, and other OS overheads in microservices [78], while different languages mean different bottlenecks, syn-
Ueda et al. [79] show the impact of compute resource allo- chronization primitives, levels of indirection, and de-
cation, application framework, and container configuration velopment effort. The suite uses applications in low-
on the performance and scalability of several microservices. and high-level, managed and unmanaged languages in-
DeathstarBench differentiates from these studies by focusing cluding C/C++, Java, Javascript, [Link], Python, html,
on large-scale applications with tens of unique microservices, Ruby, Go, and Scala.
allowing us to study effects that only emerge at large scale, • Modularity: We follow Conway’s Law [4], i.e., the
such as network contention and cascading QoS violations fact that the software architecture of a service follows
due to dependencies between tiers, as well as by including the architecture of the company that built it in the de-
diverse applications that span social networks, media and sign of the end-to-end applications, to avoid excessive
e-commerce services, and applications running on swarms two-way communication between any two dependent
of edge devices. microservices, and to ensure they are single-concerned
and loosely-coupled.
• Reconfigurability: Easily updating components of
3 The DeathStarBench Suite a larger service is one of the main advantages of mi-
We first describe the suite’s design principles, and then present croservices. Our RPC/HTTP API allows swapping out
the architecture and functionality of each end-to-end service. microservices for alternate versions, with small changes
to existing components.
3.1 Design Principles Table 1 shows the developed LoCs per service, and the
DeathStarBench adheres to the following design principles: LoCs for the communication protocol; hand-written, and
auto-generated by Thrift, where applicable. The majority of
• Representativeness: The suite is built using popular
new code for the Social Network, Media, E-commerce, and
open-source applications deployed by cloud providers,
Banking services goes towards the cross-microservice API,
such as NGINX [13], memcached [39], MongoDB [12],
as well as a few microservices for which no open-source
RabbitMQ [15], MySQL, Apache http server, ardrone-
framework existed, e.g., assigning ratings to movies. For
autonomy [2, 5], and the Sockshop microservices by
the Swarm application, we show code breakdown for two
Weave [16]. Most new code corresponds to interfaces
versions; one where the majority of computation happens in
between the services, using Apache Thrift [1], gRPC [9],
a backend cloud (Swarm Cloud), and one where it happens
or http requests.
locally on the edge devices (Swarm Edge). We also show the
• End-to-end operation: Open-source cloud services,
number of unique microservices for each application, and the
such as memcached, can function as components of a
breakdown per programming language. Unless otherwise
larger service, but do not capture the impact of inter-
noted, all microservices run in Docker containers.
service dependencies on end-to-end performance. Death-
StarBench instead implements the full functionality of 3.2 Social Network
a service from the moment a request is generated at Scope: The end-to-end service implements a broadcast-style
the client until it reaches the service’s backend and/or social network with uni-directional follow relationships.
returns to the client. Functionality: Fig. 4 shows the architecture of the end-to-
• Heterogeneity: The software heterogeneity is both end service. Users (client)send requests over http, which
a challenge and opportunity with microservices, as first reach a load balancer, implemented with nginx. Once a

6
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA

Social Network Media Service video


rent
Service movie
streaming NFS
followUser readPost blockedUsers (nginx-hls)
memcached
login userInfo
recommender mongoDB
readTimeline memcached uniqueID userReview memcached
uniqueID mongoDB mongoDB
ads
http fastcgi memcached http fastcgi text/rating composeReview movieReview memcached
Load php- userInfo Load php- mongoDB
Client nginx urlShorten mongoDB Client nginx
Balancer fpm Balancer
http
fpm reviewStorage memcached
http video memcached movieID ads
login postsStorage mongoDB
image mongoDB
recommender rating
text composePost writeTimeline memcached plot
mongoDB
userTag composePage cast
MovieDB
favorite index0 writeGraph memcached thumbnail (MySQL)
index0
index1 mongoDB photos
search index1
… search … videos
indexn indexn

Figure 4. The architecture (microservices dependency graph) Figure 5. The architecture of the Media Service for reviewing,
of Social Network. renting, and streaming movies.

specific webserver is selected, also in nginx, the latter uses funds, and a video streaming module using nginx-hls, a pro-
a php-fpm module to talk to the microservices responsible duction nginx module for HTTP live streaming. The actual
for composing and displaying posts, as well as microservices movie files are stored in NFS, to avoid the latency and com-
for advertisements, search engines, etc. All messages down- plexity of accessing chunked records from non-relational
stream of php-fpm are Apache Thrift RPCs [1]. Users can databases, while movie reviews are kept in memcached and
create posts embedded with text, media, links, and tags to MongoDB instances. Movie information is maintained in a
other users. Their posts are then broadcasted to all their sharded and replicated MySQL database. The application
followers. Users can also read, favorite, and repost posts, as also includes movie and advertisement recommenders, as
well as reply publicly, or send a direct message to another well as a couple auxiliary services for maintenance and ser-
user. The application also includes machine learning plugins, vice discovery, which are not shown in the figure. We are
such as ads and user recommender engines [22, 23, 53, 83], similarly deploying Media Service as a hosting site for project
a search service using Xapian [51], and microservices to demos at Cornell, which members of the community can
record and display user statistics, e.g., number of followers, browse and review.
and to allow users to follow, unfollow, or block other ac-
counts. The service’s backend uses memcached for caching, 3.4 E-Commerce Service
and MongoDB for persistent storage for posts, profiles, media, Scope: The service implements an e-commerce site for cloth-
and recommendations. Finally, the service is instrumented ing. The design draws inspiration, and uses several compo-
with a distributed tracing system (Sec. 3.7), which records nents of the open-source Sockshop application [16].
the latency of each network request and per-microservice Functionality: Fig. 6 shows the architecture of the end-
processing; traces are recorded in a centralized database. to-end service. The application front-end in this case is a
The service is broadly deployed at our institution, currently [Link] service. Clients can use the service to browse the
servicing several hundred users. We use this deployment to inventory using catalogue, a Go microservice that mines
quantify the tail at scale effects of microservices in Section 8. the back-end memcached and MongoDB instances holding
information about products. Users can also place orders
3.3 Media Service (Go) by adding items to their cart (Java). After they log
Scope: The application implements an end-to-end service in (Go) to their account, they can select shipping options
for browsing movie information, as well as reviewing, rating, (Java), process their payment (Go), and obtain an invoice
renting, and streaming movies [18, 19]. (Java) for their order. Orders are serialized and commit-
Functionality: Fig. 5 shows the architecture of the end-to- ted using QueueMaster (Go). Finally, the service includes
end service. As with the social network, a client request hits a recommender engine for suggested products, and microser-
the load balancer, which distributes requests among multiple vices for creating an item wishlist (Java), and displaying
nginx webservers. Users can search and browse information current discounts.
about movies, including their plot, photos, videos, cast, and
review information, as well as insert new reviews in the sys- 3.5 Banking System
tem for a specific movie by logging into their account. Users Scope: The service implements a secure banking system,
can also select to rent a movie, which involves a payment which users leverage to process payments, request loans, or
authentication module to verify that the user has enough balance their credit card.

7
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA

E-commerce socialNet Banking wealthMgmt wealthMgmtDB


authorization investment
Service payment transactionID
System memcached
Account
memcached invoicing openCreditCard
deposit mongoDB
openAccount Account
recommender accountInfo mongoDB
login personal
wishlist mongoDB Lending memcached
authentication
customerInfo business mongoDB
http ads http
Load front-end cart mongoDB Load front-end Lending
Client Client memcached
Balancer ([Link]) memcached Balancer ([Link]) payments ACL customerActivity
creditCard
http orders http mongoDB
mongoDB media transactionPosting
shipping mongoDB ads mortgages memcached
userPreferences mongoDB
queueMaster orderQueue
media offerBanners OfferDB
index0 catalogue memcached
index0
index1
search … mongoDB
search index1
… BankInfoDB
indexn
indexn
discounts mongoDB contact

Figure 6. The architecture of the E-commerce service. Figure 7. The architecture of the Banking end-to-end service.
Client Client
http http

Load Balancer Load Balancer


Front- http Front- http
Edge Edge
end nginx end nginx
wifi Router wifi Router
Controller Controller Controller Controller
MotionControl
ConstructRoute Camera ConstructRoute Camera
(image) MotionCtr (image)
Image
ImageDB Camera Camera
Cloud Image Cloud recognition (video)
(video) Edge Edge
VideoDB recognition
TargetDB Location TargetDB Location
Swarm Obstacle Swarm
LocationDB Speed Obstacle Speed
Avoidance
Avoidance Luminosity [Link]
SpeedDB Luminosity
Orientation StockImageDB VideoDB ImageDB StockImageDB Orientation
LuminosityDB
Log ([Link]) OrientationDB SpeedDB
OrientationDB
LuminosityDB LocationDB
All arrows are Thrift RPCs Arrows within a drone are IPCs All arrows after nginx are Thrift RPCs Arrows within drones are IPCs

Figure 8. The Swam service running (a) on edge devices, and (b) on the cloud. (c) Local drone swarm executing the service.

Functionality: Users interface with a [Link] front-end, the initial route per-drone (Java service ConstructRoute),
similar to the one in the E-commerce service to login to their and holding persistent copies of sensor data. This architec-
account, search information about the bank, or contact a ture avoids the high network latency between cloud and
representative. Once logged in, a user can process a payment edge, however, it is limited by the on-board resources. The
from their account, pay their credit card or request a new one, Controller and MotionController are implemented in
browse information about loans or request one, and obtain Javascript, while ImageRecognition is using jimp, a [Link]
information about wealth management options. Most mi- library for image recognition [11], and ObstacleAvoidance
croservices are written in Java and Javascript. The back-end in C++. Services on the drones run natively, and communi-
databases consist of in-memory memcached, and persistent cate with each other over IPC, while the cloud and drones
MongoDB instances. The service also has a relational database communicate over http to avoid installing the heavy depen-
(BankInfoDB) that includes information about the bank, its dencies of Thrift on the edge devices.
services, and representatives. In the second version (Fig. 8b), the cloud is responsible for
most of the computation. It performs motion control, image
3.6 Swarm Coordination recognition, and obstacle avoidance for all drones, using the
Scope: Finally, we explore a different execution environ- ardrone-autonomy [2], and Cylon [5] libraries, in OpenCV
ment for microservices, where applications run both on the and Javascript respectively. The edge devices are only re-
cloud and on edge devices. The service coordinates the rout- sponsible for collecting sensor data and transmitting them
ing of a swarm of programmable drones, which perform to the cloud, as well as recording some diagnostics using
image recognition and obstacle avoidance. a local [Link] logging service. In this case, almost every
Functionality: We explore two version of this service. In the action suffers the cloud-edge network latency, although ser-
first (Fig. 8a), the majority of the computation happens on the vices benefit from the additional cloud resources. We use
drones, including the motion planning, image recognition, 24 programmable Parrot AR2.0 drones (a subset is seen in
and obstacle avoidance, with the cloud only constructing Fig. 8c), together with a backend cluster of 20 two-socket,

8
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA

Edge-Image Recogn. Edge-Obstance Avoid. Cloud-Image Recogn. Cloud-Obstacle Avoid. Front-end Bad Speculation Back-end Retiring

5 5 5 100 Social Network 1.0 100 Ecommerce 1.4


10 10 10
Tail Latency (msec)

Cycle Breakdown (%)

Cycle Breakdown (%)


5
10 1.2
80 0.8 80
1.0
4 4 4
10 10 10 60 0.6 60 0.8

IPC

IPC
4 40 0.4 40 0.6
10
3 3 3 0.4
10 10 10
20 0.2 20
0.2
0 0.0 0 0.0
3 2 2 2
10 0 10 0 10 0 10 0

Sh ag

me Gr st
im xt

d-t db
inx

un age
u eID

om deo
nd
wr adP n

En ongo d
Mo -End
lith
mc aph
rec vi en

c s e
m qMvo nt

M o-E b
lo nd
or gin
ar s
c h

m gu t
sh me e
paipp nd
yming

En on che r
d- go d
on nd
ith
ca wis art
co lo is

m a te
re logi

m che

se er

em a ic

t d
c
o
te

in e
5 10 15 20 25 30 5 10 15 20 25 30 20 40 60 80 10 20 30 40 50

re ta hl
url serT

ol
e
me
ort
ng

no

d
nt
iqu
Queries per Second (QPS)

o
a

fro
i t e
Figure 9. Throughput-tail latency for the Swarm service
when execution happens at the edge versus the cloud. Figure 10. Cycle breakdown and IPC for the Social Network
and E-commerce services.
40-core servers. Drones communicate with each other and
the cluster over a wireless router. existing post, prepend to it, and then propagate the message
across the user’s followers’ timelines.
3.7 Methodological Challenges of Microservices In E-commerce, on the other hand, placing an order, which
A major challenge with microservices is that one cannot includes adding an item to the cart, logging in to the account,
simply rely on the client to report performance, as with tra- confirming payment, and selecting shipping, takes 1-2 orders
ditional client-server applications. Resolving performance of magnitude longer than browsing the eshop’s catalogue.
issues requires determining which microservice(s) is the cul- In reality, placing an order requires interaction with the end
prit of a QoS violation, which typically happens through user; in our case we automate the client’s decisions so they in-
distributed tracing. We developed and deployed a distributed cur zero delay, making latency server-dominated. The trends
tracing system that records per-microservice latencies at across query types are similar for the Media and Banking
RPC granularity using the Thrift timing interface. RPCs or services, with processing payments, either to rent a movie,
REST requests are timestamped upon arrival and departure or to perform a transaction in a bank account, dominating
from each microservice by the tracing module, and data is latency and defining each service’s saturation point.
accumulated by the Trace Collector, implemented simi- Finally, in Fig. 9, we compare the performance of the
larly to the Zipkin Collector [17], and stored in a centralized IoT application when computation happens at the edge ver-
Cassandra database. We additionally track the time spent sus the cloud. Since drones have to communicate with a
processing network requests, as opposed to application com- wireless router over a distance of several tens of meters,
putation using a similar methodology to [58]. We verify that latencies are significantly higher than for the cloud-only
the overhead from tracing is negligible, less than 0.1% on services. When processing happens in the cloud, latency
end-to-end latency in all cases, which is tolerable for such at low load is higher, penalized by the long network delay.
systems [26, 72, 76]. As load increases however, edge devices quickly become
oversubscribed due to the limited on-board resources, with
3.8 Provisioning & Query Diversity processing on the cloud achieving 7.8x higher throughput
Before characterizing the architectural behavior of microser- for the same tail latency, or 20x lower latency for the same
vices, we provision the end-to-end applications to ensure throughput. Obstacle avoidance shows a different trade-off,
that microservices are used in a balanced way, and that no since it is less compute-intensive, and more latency-critical.
single microservice introduces early bottlenecks due to re- Offloading obstacle avoidance to the cloud at low load can
source saturation. To do so, we start with a fair resource have catastrophic consequences if route adjustment is de-
allocation for all microservices of an end-to-end workload, layed, which highlights the importance of latency-aware
and upsize saturated microservices until all tiers saturate at resource management between cloud and edge, especially
about the same load. The ratio of resources between tiers for safety-critical computation.
varies significantly across end-to-end services, highlighting
the need for application-aware resource management. 4 Architectural Implications
Different query types also achieve different performance Methodology: We first evaluate the end-to-end services
in each service. For example, composePost requests in the on a local cluster with 20 two-socket 40-core Intel Xeon
Social Network vary in the media they embed in a message, servers (E2699-v4 and E5-2660 v3) with 128-256GB memory
ranging from text-only messages, to posts including image each, connected to a 10GBps ToR switch with 10Gbe NICs.
and video files (we keep videos within a few MBs, similar All servers are running Ubuntu 16.04, and unless otherwise
to the allowable video sizes in production social networks noted power management and turbo boosting are turned off.
like Twitter). Reposting a post incurs the longest latency Cycles breakdown and IPC: We use Intel vTune [10] to
across query types for Social Network, as it must first read an break down the cycles, and identify bottlenecks. Fig. 10

9
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA

80 Social Network 70 E-Commerce 1000 NGINX Memcached MongoDB Xapian Recommender


70 60 1200

Frequency (MHz)
1400
60 50
L1i MPKI

L1i MPKI
1600
50
40 1800
40
30 2000
30
20 20 2200
2400
10 10 0 100 200 300 400 0 100 200 300 400 0 100 200 300 400 0 100 200 300 400 0 100 200 300 400
0 0 QPS QPS QPS QPS QPS
r g

c ra t

a e
m q nvo nt
un ima xt

M o-E b

M o-Edb
te x
iq ge
ur use eID

en o
w ea log r
m rite dP in

d- go d
on nd
ith

l nd
or ogin
ar s
w ca h

m g t
sh meue
paipp nd
i eg

En on ch er
d- goed

ol d
ith
En on cheh

re cataish rt
m vi ten

em G os

co lo lis
de
ho a
in

seder

em M ic
t d
m de

ymin

on n
m a p

m ca st
lS rT

ol

e
Social Network Media Service E-commerce Banking System Swarm-Cloud
ng

nt
1000

fro

t
1200
r

Frequency (MHz)
co

1400
re

1600
1800

Figure 11. L1-i misses in Social Network and E-commerce.


2000
2200
2400
0 100 200 300 400 0 100 200 300 400 0 100 200 300 400 0 100 200 300 400 0 20 40 60 80
QPS QPS QPS QPS QPS

0 1 2
shows the IPC and cycles for each microservice in the Social Figure 12. Tail latency with increa- Tail Latency norm QoS (x1)
10 10 10

Network and E-commerce services. We omit the figures for sing load and decreasing frequency
the other services, however the observations are similar. (RAPL) for traditional monolithic cloud applications, and
Across all services a large fraction of cycles, often the ma- the five end-to-end DeathStarBench services. Lighter colors
jority, is spent in the processor front-end. Front-end stalls (yellow) denote QoS violations.
occur for several reasons, including long memory accesses
and i-cache misses. This is consistent with studies on tradi-
tional cloud applications [37, 50], although to a lesser extent
for microservices than for monolithic services (memcached, I-cache pressure: Prior work has characterized the high
mongodb), given their smaller code footprint. The majority pressure cloud applications put on the instruction caches [37,
of front-end stalls are due to fetch, while branch mispredic- 52]. Since microservices decompose what would be one large
tions account for a smaller fraction of stalls for microser- binary to many small, loosely-connected services, we exam-
vices than for other interactive applications, either cloud or ine whether previous results on i-cache pressure still hold.
IoT [37, 88]. Only a small fraction of total cycles goes towards Fig. 11 shows the MPKI of each microservice for the Social
committing instructions (21% on average for Social Network), Network and E-commerce applications. We also include the
denoting that current systems are poorly provisioned for back-end caching and database layers, as well as the corre-
microservices-based applications. sponding L1i MPKI for the monolithic implementations.
E-commerce includes a few microservices that go against First, the i-cache pressure of nginx, memcached, MongoDB,
this trend, with high IPC and high percentage of retired in- and especially the monoliths remains high, consistent with
structions, such as Search. Search (xapian [51]) is already op- prior work [37, 52, 88]. The i-cache pressure of the remaining
timized for memory locality, and has a relatively small code- microservices though is considerably lower, especially for E-
base, which explains the fewer front-end stalls. The same commerce, an expected observation given the microservices’
applies for simple microservices, such as the wishlist for small code footprints. Since [Link] applications outside
which i-cache misses are practically negligible. E-commerce the context of microservices do not have low i-cache miss
also includes a recommender engine, whose IPC is extremely rates [88], we conclude that it is the simplicity of microser-
low; this is again in agreement with studies on the archi- vices which results in better i-cache locality. Most L1i misses,
tectural behavior of ML applications [44]. The challenge especially in the Social Network happen in the kernel, and
with microservices is that although individual application are caused by Thrift. We also examined the LLC and D-TLB
components may be well understood, the structure of the misses, and found them considerably lower than for tradi-
end-to-end dependency graph defines how individual ser- tional cloud applications, which is consistent with the push
vices affect the overall performance. For both services, we for microservices to be mostly stateless.
also show the cycles breakdown and IPC for corresponding Brawny vs. wimpy cores: There has been a lot of work on
applications with the same end-to-end functionality from whether small servers can replace high-end platforms in the
the user’s perspective, but built as monoliths. In both cases, cloud [25, 46–48]. Despite the power benefits of simple cores,
monoliths are developed in Java, and include all application interactive services still achieve better latency in servers
functionality, except for the backend databases (in memcached that optimize for single-thread performance. Microservices
and MongoDB), in a single binary. The cycles breakdown is offer an appealing target for simple cores, given the small
not drastically different for monoliths compared to microser- amount of computation per microservice. We evaluate low-
vices, although they experience slightly higher percentages power machines in two ways. First, we use RAPL on our local
of committed instructions, due to reduced front-end stalls, cluster to reduce the frequency at which all microservices
as they are less likely to wait for network requests to com- run. Fig. 12 (top row) shows the change in tail latency as load
plete. IPC is also similar to microservices, and consistent with increases, and as the operating frequency decreases for five
previous studies on cloud services [37, 51]. popular, open-source single-tier interactive services: nginx,

10
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA

memcached, MongoDB, Xapian, and Recommender. We compare 5 OS & Networking Implications


these against the five end-to-end services (bottom row). We now examine the role of operating systems and network-
As expected, most interactive services are sensitive to fre- ing under the new microservices model.
quency scaling. Among the monolithic workloads, MongoDB OS vs. user-level breakdown: Fig. 14 shows the break-
is the only one that can tolerate almost minimum frequency down of cycles (C) and instructions (I) to kernel, user, and
at maximum load, due to it being I/O-bound. The other libraries for each of the end-to-end services. For all applica-
four single-tier services experience increased latency as fre- tions, and especially Social Network and Media Service, a large
quency drops, with Xapian being the most sensitive [51], fraction of execution is at kernel mode, skewed by the use of
followed by nginx, and memcached . However, looking at the memcached for in-memory caching [57], and the high network
same study for the microservices reveals that, despite the traffic, with an almost equal fraction going towards libraries
higher tail latency of the end-to-end service, microservices like libc, libgcc, libstdc, and libpthread. The breakdown is less
are much more sensitive to poor single-thread performance skewed for E-commerce and Banking, whose microservices
than traditional cloud applications. Although initially coun- are more computationally-intensive, and spend more time in
terintuitive, this result is not surprising, given the fact that user mode, while Swarm, both in its cloud and especially edge
each individual microservice must meet much stricter tail configurations, spends almost half of the time in libraries.
latency constraints compared to an end-to-end monolith, The large number of cy-
putting more pressure on performance predictability. Out OS User Libs Other
cles in the kernel is not
100
of the five end-to-end services (we omit Swarm-Edge, since surprising, given that appli-

Percentage (%)
80
compute happens on the edge devices), the Social Network 60 cations like memcached and
and E-commerce are most sensitive to low frequency, while 40
MongoDB spend most of their
the Swarm service is the least sensitive, primarily because 20
execution time in the kernel
0C I C I C I C I C I C I
it is bound by the cloud-edge communication latency, as Social Media Ecomm. Swarm Swarm to handle interrupts, pro-
opposed to compute speed.
Net Service Banking Cloud Edge
cess TCP packets, and acti-
Apart from frequency scal- Social Net
Ecommerce
Movie Service
Swarm-Cloud Figure 14. Time in ker- vate and schedule idling in-
ing, there are platforms Banking
nel mode, user mode, and teractive services [57]. The
designed with low-power libraries for each service. large number of library cy-
Tail Latency QoS (msec)

cores to begin with. We also 10


3 Xeon
Xeon@1.8 cles is also intuitive, given that microservices optimize for
evaluate the end-to-end ser- 10
2 ThunderX
speed of development, and hence leverage a lot of exist-
vices on two Cavium Thun- 1 ing libraries, as opposed to reimplementing the function-
10
derX boards (2 sockets, 48 ality from scratch. The overhead of general-purpose Linux
in-order cores per socket, 10
0
0 200 400 600 800 1000 has motivated a lot of simpler specialized kernels, such as
1.8GHz each, and a 16-way QPS
Unikernel [63], which trade off compatibility for improved
shared 16MB LLC) [25]. The Figure 13. Throughput-
performance. Similar OS designs are also applicable to single-
boards are connected on tail latency on an Intel
concerned microservices.
the same ToR switch as the Xeon and a Cavium Thun-
Computation:communication ratio: Fig. 15a shows the
rest of our cluster, and their derX server for all end-to-
time spent processing network requests compared to applica-
memory and network sub- end services.
tion computation at low and high load for the microservices
systems are the same as the other servers. Fig. 13 shows the in Social Network. Fig. 15b shows the fraction of tail latency
throughput at the saturation point for each application on spent processing RPC requests for the remaining end-to-end
the two platforms. We also show the performance of the services. At low load, RPC processing corresponds to 5-75%
Xeon server when equalizing its frequency to the Cavium of execution time across the Social Network’s microservices,
board. Although ThunderX is able to meet the end-to-end and 18% of end-to-end tail latency. This is caused by several
QoS target at low load, all five applications saturate much microservices being too simple to involve considerable pro-
earlier than on the high-end server. This is especially the cessing. In comparison, network processing accounts for a
case in Social Network, and Media Service because of their lower fraction of latency in E-commerce and Banking, primar-
stricter latency requirements, and E-commerce, because it ily because their microservices are more computationally
is more compute intensive. As with power management, intensive. Finally, network processing accounts for over 30%
Swarm does not suffer as much, because it is network-bound. of tail latency in both Swarm settings, even at low load.
Running the Xeon server at 1.8GHz, although worse than its At high load, network processing becomes a much more
performance at the nominal frequency, still outperforms the pronounced factor of tail latency for all end-to-end services,
Cavium SoC considerably. Even though low power machines except for E-commerce, and Banking, as long queues build
degrade performance in this case, they can still be used for up in the NICs. This has a significant impact on tail latency,
microservices off the critical path, or those insensitive to
frequency scaling.

11
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA

2.5

Tail Latency (ms)


Social Network 70 A. NGINX Saturation

Network Processing (%)


2.0 NGINX
12 Low Load A. NGINX Saturation
Application proc 60 Memcached
Tail Latency (ms)

10
read <k,v> 1.5
50 High Load
TCP proc (RPCs) read <k,v> 1.0
8 40 NGINX memcached 0.5
6 30 NGINX memcached 0.0
4 0 10 20 30 40 50 60
20 Time (s)
2 10 B. Memcached Backpressuring NGINX
0 0 read <k,v>B. Memcached Backpressuring NGINX 12

Tail Latency (ms)


ort g

me Gra st
im xt

d-t db
inx

un age
us eID

om eo
nd
re ogin

mo ched

Mo End
lith
mc ph
rec vid n

tw l

rvi a

rce

ing

Clo arm

Ed rm
ork

om ce
Ne ocia
url erTa

read <k,v> 10 NGINX


e

Se edi
wr adPo
te

ud

ge
En ngo
me

a
ng

no

nk
me
iqu

o-

Sw

Sw
l

8 Memcached

M
S
a
NGINX memcached
Sh

Ba
ite

6
NGINX memcached

Ec
4
2
0
Figure 15. Time in application vs network processing for (a) 0 10 20 30
Time (s)
40 50 60

microservices in Social Network, and (b) the other services.


Figure 17. Example of backpressure between microservices
2 Network Proc. End-to-End Latency in a simple, two-tier application. Case A shows a typical
Speedup over Native (x1)

DRAM DRAM 10
hotspot that autoscalers can easily address, while Case B
DRAM

10Gbps 1
shows that a seemingly negligible bottleneck in memcached
PCIe Gen3
CPU
QPI
CPU Virtex7 QSFP 10

0 can cause the front-end NGINX service to saturate.


QSFP

PCIe Gen3 10

NIC -1
10
QSFP
10Gbps
tw l

rvi a

rce

ing

Clo arm

Ed arm
ork

ce
NeSocia

Se edi

ud 6 Cluster Management Implications


ge
nk
me

Sw

Sw
M

Ba
om

Microservices complicate cluster management, because de-


Ec

Figure 16. (a) Overview of the FPGA configuration for RPC pendencies between tiers can introduce backpessure effects,
acceleration, and (b) the performance benefits of acceleration leading to system-wide hotspots [56, 59, 82, 85, 87]. Back-
in terms of network and end-to-end tail latency. pressure can additionally trick the cluster manager into pe-
nalizing or upsizing a saturated microservice, even though
its saturation is the result of backpressure from another, po-
tentially not-saturated service. Fig. 17 highlights this issue
with the Social Network experiencing a 3.2× increase in end- for a simplified two-tier application consisting of a web-
to-end tail latency. The large impact of network processing server (nginx), and an in-memory caching key-value store
occurs regardless of whether microservices communicate (memcached). In case A, as the client issues read requests, nginx
over RPCs (Social Network, Media Service, Banking), or over reaches saturation, causing its latency to increase rapidly,
HTTP (E-commerce, Swarm-Edge), although RPCs introduce and long queues to form in its input. This is a straightfor-
considerably lower latencies at low load than HTTP. Finally, ward case, which autoscaling systems can easily tackle by
Fig. 15a also shows the time the monolithic Social Network scaling out nginx, as seen in the figure at t = 14s and t = 35s.
application spends processing network requests. Both at low, Case B on the other
and especially at high load the difference is dramatic, albeit hand, highlights the
justified, since monoliths are deployed as single binaries, challenges of backpres-
with the majority of the network traffic corresponding to sure. When using HTTP1,
client-server communication. requests within a sin-
Given the prominent role network processing has on tail gle connection are block- Netflix Twitter

latency, we now examine its potential for acceleration. ing, i.e., there can only
We use a bump-in-the-wire setup, seen in Fig. 16a, and be one outstanding re-
similar to the one in [38] to offload the entire TCP stack [54, quest per connection
69, 70, 74, 75] on a Virtex 7 FPGA using Vivado HLS. The across tiers. Therefore,
FPGA is placed between the NIC and the top of rack switch
even though memcached
Amazon Social Network
(ToR), and is connected to both with matching transceivers, itself is not saturated,
it causes long queues Figure 18. Microservices
acting as a filter on the network. We maintain the PCIe con-
of outstanding requests graphs for three real produc-
nection between the host and the FPGA for accelerating other
to form ahead of nginx, tion cloud providers [6, 18, 19].
services, such as the machine learning models in the recom-
which in turn cause We also show these dependen-
mender engines, during periods of low network load. Fig. 16b
it to saturate. Current cies for Social Network.
shows the speedup from acceleration on network processing
latency alone, and on the end-to-end latency of each of the cluster managers cannot easily address this case, as a
services. Network processing latency improves by 10 − 68x utilization-based autoscaling scheme would scale out nginx,
over native TCP, while end-to-end tail latency improves by which is budy waiting and appears saturated. As seen in the
43% and up to 2.2x. For interactive, latency-critical services, figure, not only does this not solve the problem, but can po-
where even a small improvement in tail latency is significant, tentially make it worse, by admitting even more traffic into
network acceleration provides a major boost in performance. the system. Even without the connection blocking in HTTP1,

12
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA

Back-end Back-end 2 Back-end 2


10 Monolith 10

Latency increase (%)


Microservices Instances

Microservices Instances

Microservices Instances
Detection
3

CPU Utilization (%)


10

CPU Utilization (%)


2 Microservices
10

Tail Latency (ms)


2
10
1
10 10
1
1
10
1 QoS
10
0 0
10
0 10 10
0
10 0 50 100 150 200 250 300
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300
Front-end Time (s) Front-end Time (s) Time (s) Front-end Time (s)

Figure 19. Cascading QoS violations in Social Network com- Figure 20. (a) Microservices taking longer than monoliths
pared to per-microservice CPU utilization. to recover from a QoS violation, even (b) in the presence of
autoscaling mechanisms.

backpressure still occurs, as multi-tier applications are not


perfect pipelines where tiers operate entirely independently. performance. This is because, as shown in Fig. 20b, the au-
Unfortunately real-world cloud applications are much toscaler simply upsizes the resources of saturated services
more complex than this simple example suggests. Fig. 18 - seen by the progressively darker colors of highly-utilized
shows the microservices dependency graphs for three ma- microservices. However, services with the highest utilization
jor cloud service providers, and for one of our applications are not necessarily the culprits of a QoS violation [61], taking
(Social Network). The perimeter of the circle (or sphere sur- the system much longer to identify the correct source behind
face) shows the different microservices, and edges show the degraded performance and upsizing it. As a result, by
dependencies between them. Such dependencies are difficult the time the culprit is identified, long queues have already
for developers or users to describe, and furthermore, they built up which take considerable time to drain.
change frequently, as old microservices are swapped out and
replaced by newer services.
7 Application & Programming Framework
Fig. 19 shows the impact of cascading QoS violations in the Implications
Social Network service. Darker colors show tail latency closer Latency breakdown per microservice: We first examine
to nominal operation for a given microservice in Fig. 19a, whether the end-to-end services experience imbalance across
and low utilization in Fig. 19b. Brighter colors signify high tiers, with some microservices being responsible for a dispro-
per-microservice tail latency and high CPU utilization. Mi- portionate amount of computation or end-to-end latency, or
croservices are ordered based on the service architecture, being prone to creating hotspots. We examine each service
from the back-end services at the top, to the front-end at at low and high load and obtain the per-microservice latency
the bottom. Fig. 19a shows that once the back-end service at using our distributed tracing framework, and confirm it with
the top experiences high tail latency, the hotspot propagates Intel’s vTune. Both for the Social Network and Media Service
to its upstream services, and all the way to the front-end. latency at low load is dominated by the front-end (nginx),
Utilization in this case can be misleading. Even though the while the rest of the microservices are almost evenly dis-
saturated back-end services have high utilization in Fig. 19b, tributed. MongoDB is the only exception, accounting for
microservices in the middle of the figure also have even 8.5% and 10.3% of end-to-end latency respectively.
higher utilization, without this translating to QoS violations. This picture changes at high load. While the front-end
Conversely, there are microservices with relatively low still contributes considerably to latency, overall performance
utilization and degraded performance, for example, due to is now limited by the back-end databases, and the microser-
waiting on a blocking/synchronous request from another, sat- vices that manage them, e.g., writeGraph. The Ecommerce and
urated tier. This highlights the need for cluster managers that Banking services experience similar fluctuations across load
account for the impact dependencies between microservices levels, and are additionally impacted by the fact that several
have on end-to-end performance when allocating resources. of their services are compute intensive, and written in high-
Finally, the fact that hotspots propagate between tiers level languages, like [Link] and Go. This affects execution
means that once microservices experience a QoS violation, time, with orders, catalogue, and payment accounting for the
they need longer to recover than traditional monolithic ap- majority of end-to-end latency for Ecommerce, and payments
plications, even in the presence of autoscaling mechanisms, and authentication for Banking. The back-end databases
which most cloud providers employ. Fig. 20 shows such a in this case contribute less to execution time, showing that
case for Social Network implemented with microservices, the choice of programming language affects how hotspots
and as a monolith in Java. In both cases the QoS violation evolve in the system. queueMaster also experiences high la-
is detected at the same time. However, while the cluster tency in E-commerce, as it uses synchronization to ensure
manager can simply instantiate new copies of the monolith that orders are serialized, processed, and committed in order,
and rebalance the load, autoscaling takes longer to improve which constrains its scalability at high load.

13
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA

minutes. The margins of box plots show the 25t h and 75t h la-
$2.08
Amazon EC2
Tail Latency (ms)

$3.65
AWS Lambda (S3)
2
AWS Lambda (mem) $4.56
$2.19 $14.8
tency percentiles, while the whiskers show the 5t h and 95t h .
10
In Lambda, we show performance and cost both for the de-
$3.16 $4.02
$2.85 $6.87
$37.6
$21.6

1
$28.8
$3.93 $24.1 $5.02
fault persistent storage (S3), and for a configuration that uses
10
the memory of four additional EC2 instances to maintain
Social
Network
Media
Service
Ecommerce Banking
System
Swarm
Cloud intermediate state passed through dependent microservices.
25 500 Latency is considerably higher for Lambda when using
S3, primarily due to the overhead and rate limiting of the re-
EC2
Tail Latency (ms)

Input Load (QPS)


20 400
Lambda
15 300
mote persistent storage. This occurs even though the amount
10 200
of data transfered between microservices is small, to ad-
here to the design principle that microservices should be
5 100
0 0
0 50 100 150
Time (s)
200 250 300
mostly stateless [18]. The majority of this overhead disap-
pears when using remote memory to pass state between
Figure 21. Performance and cost for the five services on dependent serverless functions. Even in this case though,
Amazon EC2 and AWS Lambda (top). Tail latency for Social performance variability is higher in Lambda, as functions
Network under a diurnal load pattern (bottom). can be placed anywhere in the datacenter, incurring variable
network latencies, and suffering interference from external
functions co-scheduled on the same physical machines (EC2
Finally, the Swarm coordination service experiences dif- instances are dedicated to our services). Note that even in
ferent trade-offs when running on the cloud compared to the EC2 scenario, dependent microservices are placed on
the edge devices. While imageRecognition dominates latency different physical machines to ensure a fair comparison in
regardless of where the microservice is running, its impact terms of network traffic. On the other hand, cost is almost an
on tail latency is more severe when running at the resource- order of magnitude lower for Lambda, especially when using
limited edge, to the point of preventing the motion controller S3, as resources are only charged on a per-request basis.
from engaging, due to insufficient resources. The bottom of Fig. 21 highlights the ability of serverless
This shows that not only bottlenecks vary across end-to- to elastically scale resources on demand. The input load is
end services, despite individual microservices being same/sim- real user traffic in Social Network, which follows a diurnal
ilar, but that these bottlenecks additionally change with load, pattern. In the interest of cost, we have compressed the load
putting more pressure on dynamic and agile management. pattern to a shorter period of time and replayed it using
Serverless frameworks: Microservices are often used in our open-loop workload generator. Even though EC2 ex-
the context of serverless programming frameworks, i.e., frame- periences lower tail latency than Lambda during low load
works where the application and data are managed by the periods, consistent with the findings above, when load in-
cloud provider, and the user simply launches short-lived creases, Lambda adjusts resources to user demand faster
“functions”, and is charged on a per-request basis [3]. Server- than EC2. This is because the increased number of requests
less is well-suited for applications with intermittent activity, translates to more Lambda functions without requiring the
where maintaining long-running instances is cost inefficient. user to intervene. In comparison, in EC2, we use an autoscal-
Serverless additionally targets embarrassingly parallel ser- ing mechanism that examines utilization, and scales allo-
vices, which benefit from a massive amount of resources cations by requesting extra instances, when it exceeds a
for a brief period of time. At the same time, serverless adds pre-determined threshold (70% in this case, consistent with
an extra level of indirection, as applications have to be in- EC2 default autoscaler [20]). This has a negative impact on
strumented (or re-written) to interface with the serverless latency, since the system waits for load to increase substan-
framework [8, 14]. Additionally, since serverless functions tially before employing additional resources, and initializing
are ephemeral, data has to be stored in persistent storage new resources is not instantaneous. For microservices to
for subsequent functions to operate on it. On AWS Lambda reach the potential serverless offeres, they need to remain
the output of functions is stored in S3, which can introduce mostly stateless, and leverage in-memory primitives to pass
significant overheads compared to in-memory computation. data between dependent functions.
Fig. 21 (top) shows the performance and cost of each end-
to-end service on traditional containers on Amazon EC2
versus AWS Lambda functions. Each microservice is instru- 8 Tail At Scale Implications
mented to interface with Lambda’s API. For a number of We now focus on the Social Network service to study the
microservices written in languages that are not currently tail at scale effects of microservices, i.e., effects that occur
supported by Lambda, we also had to reimplement the mi- because of the large-scale of systems and applications [28].
croservice logic. In the case of EC2, each service uses between We The Social Network has several hundred registered users,
20-64 m5.12xlarge instances. We run each service for 10 and 165 active daily users on average. The input load for this

14
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA

Back-end
10
3
1.0 Request Skew 1.0 Impact of slow servers: Fig. 22c shows the impact a small

Latency increase (%)


Microservices Instances

Micro (40)
number of slow servers has on overall QoS as cluster size in-

Max QPS at QoS

Max QPS at QoS


0.8 0.8 Micro (100)
2
10 Micro (200)
0.6 0.6
Mono (40) creases. We purposely slow down a small fraction of servers
10
1 0.4 0.4 Mono (100)
Mono (200) by enabling aggressive power management, which we al-
10
0
0.2 0.2
ready saw is detrimental to performance (Sec. 4). For large
0 100 200 300 400 500 600
0.0
0 20 40 60 80 100
0.0
0 1 2 3 4 5 clusters (>100 instances), when 1% or more of servers behave
Front-end Time (s) Skew (%) Slow Servers (%)
poorly, the goodput is almost zero, as these servers host at
Figure 22. (a) Cascading hotspots in the large-scale Social least one microservice on the critical path, degrading QoS.
Network deployment, and tail at scale effects from (b) request Even for small clusters (40 instances), a single slow server is
skew, and (c) slow servers. the most the service can sustain and still achieve some QPS
under QoS. Finally, we compare the impact of slow servers
in clusters of equal size for the monolithic design of Social
study is real user-generated traffic. To scale to larger clusters Network. In this case goodput is higher, even as cluster sizes
than our local infrastructure allows, we deploy the service grow, since a single slow server only affects the instance of
on a dedicated EC2 cluster with 40 up to 200 c5.18xlarge the monolith hosted on it, while the other instances operate
instances (72 vCPUs, 144GB RAM each). independently. The only exception are back-end databases,
Large-scale cascading hotspots: Fig. 22a shows the perfor- which even for the monolith are shared across application
mance impact of dependencies between microservices on 100 instances, and sharded across machines. If one of the slow
EC2 instances. Microservices on the y-axis are again ordered servers is hosting a database shard, all requests directed to
from the back-end in the top to the front-end in the bottom. that instance are degraded. In general, the more complex an
While initially all microservices are behaving nominally, at application’s microservices graph, the more impactful slow
t = 260s the middle tiers, and specifically composePost, and servers are, as the probability that a service on the critical
readPost become saturated due to a switch routing miscon- path will be degraded increases.
figuration that overloaded one instance of each microservice,
9 Conclusions
instead of load balancing requests across different instances.
This in turn causes their downstream services to saturate, We have presented DeathStarBench, an open-source suite
causing a similar waterfall pattern in per-tier latency to the for cloud and IoT microservices. The suite includes repre-
one in Fig. 19. Towards the end of the sampled time (t > 500s) sentative services, such as social networks, video streaming,
the back-end services also become saturated for a similar e-commerce, and swarm control services. We use DeathStar-
reason, causing microservices earlier in the critical path to Bench to study the implications microservices have across
saturate. This is especially evident for microservices in the the cloud system stack, from datacenter server design and
middle of the y-axis (bright yellow), whose performance was hardware acceleration, to OS and networking overheads, and
already degraded from the previous QoS violation. To allow cluster management and programming framework design.
the system to recover in this case we employed rate limit- We also quantify the tail-at-scale effects of microservices as
ing, which constrains the admitted user traffic until current clusters grow in size, and services become more complex,
hotspots dissipate. Even though rate limiting is effective, it and show that microservices put increased pressure in low
affects user experience by dropping a fraction of requests. tail latency and performance predictability.
Request skew: Load is rarely uniform in user-facing cloud
services, with some users being responsible for a dispropor- DeathStarBench Release
tionate amount of generated load. Real traffic in the Social The applications in DeathStarBench are publicly available
Network usually adheres to this principle, with a small frac- at: [Link] under a GPL licence.
tion of users, around 5% being responsible for more than 30% We welcome feedback and suggestions, and hope that by
of the requests. To study request skew to its extreme we ad- releasing the benchmark suite publicly, we can encourage
ditionally inject synthetic users that generate a much larger more work in this emerging field.
number of requests than typical users. Specifically, we vary
skew from 0 to 99%, where skew is defined as [100 − u], with Acknowledgements
u the fraction of users initiating 90% of total requests. Skew We sincerely thank Christos Kozyrakis, Daniel Sanchez, David
of 0% means uniform request distribution. Fig. 22b shows the Lo, as well as the academic and industrial users of the bench-
impact of skew on the max sustained load for which QoS is mark suite, and the anonymous reviewers for their feedback
met. When skew=0%, the service achieves its max QPS under on earlier versions of this manuscript. This work was sup-
QoS for that cluster size (100 instances). As skew increases, ported in part by NSF grant CNS-1422088, a Facebook Faculty
goodput (throughput under QoS) quickly drops, and when Research Award, a John and Norma Balen Sesquicentennial
less than 20% of users are responsible for the majority of Faculty Fellowship, and generous donations from Google
requests, goodput is almost zero. Compute Engine, Windows Azure, and Amazon EC2.

15
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA

References Lanka, Friedel van Megen, Dima Mukhortov, Prerak Patel, Brandon
[1] [n. d.]. Apache Thrift. [Link] Perez, Amanda Rapsang, Steven K. Reinhardt, Bita Rouhani, Adam
[2] [n. d.]. ardrone-autonomy. [Link] Sapek, Raja Seera, Sangeetha Shekar, Balaji Sridharan, Gabriel Weisz,
en/latest/. Lisa Woods, Phillip Yi Xiao, Dan Zhang, Ritchie Zhao, and Doug Burger.
[3] [n. d.]. AWS Lambda. [Link] 2018. Serving DNNs in Real Time at Datacenter Scale with Project
[4] [n. d.]. Conway’s Law. [Link] Brainwave. IEEE Micro 38, 2 (2018), 8–20. [Link]
[Link]. 2018.022071131
[5] [n. d.]. [Link]. [Link] [28] Jeffrey Dean and Luiz Andre Barroso. [n. d.]. The Tail at Scale. In
[6] [n. d.]. Decomposing Twitter: Adventures in Service- CACM, Vol. 56 No. 2, Pages 74-80.
Oriented Architecture. [Link] [29] Christina Delimitrou and Christos Kozyrakis. [n. d.]. Paragon: QoS-
decomposing-twitter-adventures-in-serviceoriented-architecture. Aware Scheduling for Heterogeneous Datacenters. In Proceedings of
[7] [n. d.]. Finagle: An extensible RPC system for the JVM. [Link] the Eighteenth International Conference on Architectural Support for
[Link]/finagle. Programming Languages and Operating Systems (ASPLOS). Houston,
[8] [n. d.]. fission: Serverless Functions for Kubernetes. [Link] TX, USA, 2013.
[9] [n. d.]. gRPC: A high performance open-source universal RPC frame- [30] Christina Delimitrou and Christos Kozyrakis. [n. d.]. QoS-Aware Sched-
work. [Link] uling in Heterogeneous Datacenters with Paragon. In ACM Transac-
[10] [n. d.]. Intel VTune Amplifier. [Link] tions on Computer Systems (TOCS), Vol. 31 Issue 4. December 2013.
intel-vtune-amplifier-xe. [31] Christina Delimitrou and Christos Kozyrakis. [n. d.]. Quality-
[11] [n. d.]. jimp: An image processing library in [Link] with zero external of-Service-Aware Scheduling in Heterogeneous Datacenters with
dependencies. [Link] Paragon. In IEEE Micro Special Issue on Top Picks from the Computer
[12] [n. d.]. mongoDB. [Link] Architecture Conferences. May/June 2014.
[13] [n. d.]. NGINX. [Link] [32] Christina Delimitrou and Christos Kozyrakis. [n. d.]. Quasar: Resource-
[14] [n. d.]. OpenLambda. [Link] Efficient and QoS-Aware Cluster Management. In Proceedings of the
[15] [n. d.]. RabbitMQ. [Link] Nineteenth International Conference on Architectural Support for Pro-
[16] [n. d.]. SockShop: A Microservices Demo Application. [Link] gramming Languages and Operating Systems (ASPLOS). Salt Lake City,
[Link]/blog/sock-shop-microservices-demo-application. UT, USA, 2014.
[17] [n. d.]. Zipkin. [Link] [33] Christina Delimitrou and Christos Kozyrakis. 2016. HCloud: Resource-
[18] 2016. The Evolution of Microservices. [Link] Efficient Provisioning in Shared Cloud Systems. In Proceedings of the
adriancockcroft/evolution-of-microservices-craft-conference. Twenty First International Conference on Architectural Support for Pro-
[19] Adrian Cockroft [n. d.]. Microservices Workshop: Why, what, gramming Languages and Operating Systems (ASPLOS).
and how to get there. [Link] [34] Christina Delimitrou and Christos Kozyrakis. 2017. Bolt: I Know
microservices-workshop-craft-conference. What You Did Last Summer... In The Cloud. In Proceedings of the
[20] autoscaleLimit [n. d.]. AWS Autoscaling. [Link] Twenty Second International Conference on Architectural Support for
autoscaling/. Programming Languages and Operating Systems (ASPLOS).
[21] Luiz Barroso and Urs Hoelzle. 2009. The Datacenter as a Computer: An [35] Christina Delimitrou and Christos Kozyrakis. 2018. Amdahl’s Law for
Introduction to the Design of Warehouse-Scale Machines. MC Publishers. Tail Latency. In Communications of the ACM (CACM).
[22] Robert Bell, Yehuda Koren, and Chris Volinsky. 2007. The BellKor 2008 [36] Christina Delimitrou, Daniel Sanchez, and Christos Kozyrakis. 2015.
Solution to the Netflix Prize. Technical Report. Tarcil: Reconciling Scheduling Speed and Quality in Large Shared Clus-
[23] Leon Bottou. [n. d.]. Large-Scale Machine Learning with Stochastic ters. In Proceedings of the Sixth ACM Symposium on Cloud Computing
Gradient Descent. In Proceedings of the International Conference on (SOCC).
Computational Statistics (COMPSTAT). Paris, France, 2010. [37] Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos,
[24] Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel
Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Popescu, Anastasia Ailamaki, and Babak Falsafi. [n. d.]. Clearing
Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin the Clouds: A Study of Emerging Scale-out Workloads on Modern
Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Hardware. In Proceedings of the Seventeenth International Conference
Chiou, and Doug Burger. 2016. A Cloud-scale Acceleration Archi- on Architectural Support for Programming Languages and Operating
tecture. In The 49th Annual IEEE/ACM International Symposium on Systems (ASPLOS). London, England, UK, 2012, 12. [Link]
Microarchitecture (MICRO-49). IEEE Press, Piscataway, NJ, USA, Arti- 1145/2150976.2150982
cle 7, 13 pages. [Link] [38] Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou,
[25] Shuang Chen, Shay Galon, Christina Delimitrou, Srilatha Manne, and Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu,
Jose F. Martinez. 2017. Workload Characterization of Interactive Cloud Adrian Caulfield, Eric Chung, Harish Kumar Chandrappa, Somesh
Services on Big and Small Server Platforms. In Proc. of IISWC. Chaturmohta, Matt Humphrey, Jack Lavier, Norman Lam, Fengfen
[26] Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F. Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel,
Wenisch. 2014. The Mystery Machine: End-to-end Performance Tejas Sapre, Mark Shaw, Gabriel Silva, Madhan Sivakumar, Nisheeth
Analysis of Large-scale Internet Services. In Proceedings of the 11th Srivastava, Anshuman Verma, Qasim Zuhair, Deepak Bansal, Doug
USENIX Conference on Operating Systems Design and Implementa- Burger, Kushagra Vaid, David A. Maltz, and Albert Greenberg. 2018.
tion (OSDI’14). USENIX Association, Berkeley, CA, USA, 217–231. Azure Accelerated Networking: SmartNICs in the Public Cloud. In
[Link] 15th USENIX Symposium on Networked Systems Design and Implemen-
[27] Eric S. Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, tation (NSDI 18). USENIX Association, Renton, WA, 51–66. https:
Adrian M. Caulfield, Todd Massengill, Ming Liu, Daniel Lo, Shlomi //[Link]/conference/nsdi18/presentation/firestone
Alkalay, Michael Haselman, Maleen Abeydeera, Logan Adams, Hari [39] Brad Fitzpatrick. [n. d.]. Distributed caching with memcached. In
Angepat, Christian Boehn, Derek Chiou, Oren Firestein, Alessandro Linux Journal, Volume 2004, Issue 124, 2004.
Forin, Kang Su Gatlin, Mahdi Ghandi, Stephen Heil, Kyle Holohan, [40] Jason Flinn. September 2012. Cyber Foraging: Bridging Mobile and Cloud
Ahmad El Husseini, Tamás Juhász, Kara Kagi, Ratna Kovvuri, Sitaram Computing. Synthesis Lectures on Mobile and Pervasive Computing.

16
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA

[41] Yu Gan and Christina Delimitrou. 2018. The Architectural Implications [53] Krzysztof C. Kiwiel. [n. d.]. Convergence and efficiency of subgradient
of Cloud Microservices. In Computer Architecture Letters (CAL), vol.17, methods for quasiconvex minimization. In Mathematical Programming
iss. 2. (Series A) (Berlin, Heidelberg: Springer) 90 (1): pp. 1-25, 2001.
[42] Vishal Gupta and Karsten Schwan. [n. d.]. Brawny vs. Wimpy: Evalua- [54] David Koeplinger, Raghu Prabhakar, Yaqi Zhang, Christina Delim-
tion and Analysis of Modern Workloads on Heterogeneous Processors. itrou, Christos Kozyrakis, and Kunle Olukotun. 2016. Automatic
In Proceedings of IEEE International Symposium on Parallel & Distributed Generation of Efficient Accelerators for Reconfigurable Hardware.
Processing (IPDPS). Boston, MA, 2013. In 43rd ACM/IEEE Annual International Symposium on Computer Ar-
[43] Ragib Hasan, Md. Mahmud Hossain, and Rasib Khan. 2015. Aura: chitecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016. 115–127.
An IoT Based Cloud Infrastructure for Localized Mobile Computation [Link]
Outsourcing. In 3rd IEEE International Conference on Mobile Cloud [55] Nane Kratzke and Peter-Christian Quint. 2016. Ppbench. In Proceedings
Computing, Services, and Engineering, MobileCloud. San Francisco, CA, of the 6th International Conference on Cloud Computing and Services
183–188. [Link] Science - Volume 1 and 2 (CLOSER 2016). SCITEPRESS - Science and
[44] Johann Hauswald, Michael A. Laurenzano, Yunqi Zhang, Cheng Li, Technology Publications, Lda, Portugal, 223–231.
Austin Rovinski, Arjun Khurana, Ronald G. Dreslinski, Trevor Mudge, [56] Chien-An Lai, Josh Kimball, Tao Zhu, Qingyang Wang, and Calton Pu.
Vinicius Petrucci, Lingjia Tang, and Jason Mars. 2015. Sirius: An Open 2017. milliScope: A Fine-Grained Monitoring Framework for Perfor-
End-to-End Voice and Vision Personal Assistant and Its Implications mance Debugging of n-Tier Web Services. In 37th IEEE International
for Future Warehouse Scale Computers. In Proceedings of the Twentieth Conference on Distributed Computing Systems, ICDCS 2017, Atlanta,
International Conference on Architectural Support for Programming GA, USA, June 5-8, 2017. 92–102.
Languages and Operating Systems (ASPLOS ’15). ACM, New York, NY, [57] Jacob Leverich and Christos Kozyrakis. [n. d.]. Reconciling High Server
USA, 223–238. [Link] Utilization and Sub-millisecond Quality-of-Service. In Proceedings of
[45] Ben Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. EuroSys. Amsterdam, The Netherlands, 2014.
Joseph, Randy Katz, Scott Shenker, and Ion Stoica. [n. d.]. Mesos: A [58] Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble.
Platform for Fine-Grained Resource Sharing in the Data Center. In 2014. Tales of the Tail: Hardware, OS, and Application-level Sources of
Proceedings of NSDI. Boston, MA, 2011. Tail Latency. In Proceedings of the ACM Symposium on Cloud Computing
[46] Urs Hölzle. [n. d.]. Brawny cores still beat wimpy cores, most of the (SOCC ’14). ACM, New York, NY, USA, Article 9, 14 pages.
time. In IEEE Micro. 2010. [59] Jack Li, Qingyang Wang, Chien-An Lai, Junhee Park, Daisaku
[47] Vijay Janapa Reddi, Benjamin C. Lee, Trishul Chilimbi, and Kushagra Yokoyama, and Calton Pu. 2014. The Impact of Software Resource
Vaid. [n. d.]. Mobile Processors for Energy-Efficient Web Search. In Allocation on Consolidated n-Tier Applications. In 2014 IEEE 7th Inter-
ACM Transactions on Computer Systems, Vol. 29, No. 4, Article 9. 2011. national Conference on Cloud Computing, Anchorage, AK, USA, June 27
[48] Vijay Janapa Reddi, Benjamin C. Lee, Trishul Chilimbi, and Kusha- - July 2, 2014. 320–327.
gra Vaid. 2010. Web Search Using Mobile Cores: Quantifying and [60] Ching-Chi Lin, Pangfeng Liu, and Jan-Jan Wu. [n. d.]. Energy-
Mitigating the Price of Efficiency. In Proceedings of the 37th Annual In- Aware Virtual Machine Dynamic Provision and Scheduling for Cloud
ternational Symposium on Computer Architecture (ISCA ’10). ACM, New Computing. In Proceedings of the 2011 IEEE 4th International Confer-
York, NY, USA, 314–325. [Link] ence on Cloud Computing (CLOUD). Washington, DC, USA, 2011, 2.
[49] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gau- [Link]
rav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Bo- [61] David Lo, Liqun Cheng, Rama Govindaraju, Luiz André Barroso, and
den, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Christos Kozyrakis. [n. d.]. Towards Energy Proportionality for Large-
Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, scale Latency-critical Workloads. In Proceedings of the 41st Annual
Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, International Symposium on Computer Architecuture (ISCA). Minneapo-
Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert lis, MN, 2014.
Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexan- [62] David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ran-
der Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen ganathan, and Christos Kozyrakis. [n. d.]. Heracles: Improving Re-
Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris source Efficiency at Scale. In Proc. of the 42Nd Annual International
Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adri- Symposium on Computer Architecture (ISCA). Portland, OR, 2015.
ana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi [63] Anil Madhavapeddy, Richard Mortier, Charalampos Rotsos, David
Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omer- Scott, Balraj Singh, Thomas Gazagnaire, Steven Smith, Steven Hand,
nick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, and Jon Crowcroft. 2013. Unikernels: Library Operating Systems for
Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew the Cloud. In Proceedings of the Eighteenth International Conference
Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gre- on Architectural Support for Programming Languages and Operating
gory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Systems (ASPLOS ’13). ACM, New York, NY, USA, 461–472. https:
Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. //[Link]/10.1145/2451116.2451167
In-Datacenter Performance Analysis of a Tensor Processing Unit. In [64] Jason Mars and Lingjia Tang. [n. d.]. Whare-map: heterogeneity in
Proceedings of the 44th Annual International Symposium on Computer "homogeneous" warehouse-scale computers. In Proceedings of ISCA.
Architecture (ISCA ’17). ACM, New York, NY, USA, 1–12. Tel-Aviv, Israel, 2013.
[50] Svilen Kanev, Juan Darago, Kim Hazelwood, Parthasarathy Ran- [65] David Meisner, Christopher M. Sadler, Luiz André Barroso, Wolf-
ganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. 2014. Pro- Dietrich Weber, and Thomas F. Wenisch. 2011. Power management
filing a warehouse-scale computer. In ISCA ’15 Proceedings of the 42nd of online data-intensive services. In Proceedings of the 38th annual
Annual International Symposium on Computer Architecture. 158–169. international symposium on Computer architecture. 319–330.
[51] Harshad Kasture and Daniel Sanchez. 2016. TailBench: A Benchmark [66] Ripal Nathuji, Canturk Isci, and Eugene Gorbatov. [n. d.]. Exploiting
Suite and Evaluation Methodology for Latency-Critical Applications. platform heterogeneity for power efficient data centers. In Proceedings
In Proceedings of the IEEE International Symposium on Workload Char- of ICAC. Jacksonville, FL, 2007.
acterization (IISWC). [67] Ripal Nathuji, Aman Kansal, and Alireza Ghaffarkhah. [n. d.]. Q-
[52] Cansu Kaynak, Boris Grot, and Babak Falsafi. 2013. SHIFT: shared his- Clouds: Managing Performance Interference Effects for QoS-Aware
tory instruction fetch for lean-core server processors. In The 46th An-
nual IEEE/ACM International Symposium on Microarchitecture (MICRO-
46). Davis, CA, 272–283. [Link]

17
Session: Cloud I ASPLOS’19, April 13–17, 2019, Providence, RI, USA

Clouds. In Proceedings of EuroSys. Paris,France, 2010. Workload Characterization, IISWC 2018, Raleigh, NC, USA, September
[68] Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. [n. 30 - October 2, 2018. 1–12.
d.]. Sparrow: Distributed, Low Latency Scheduling. In Proceedings of [79] Takanori Ueda, Takuya Nakaike, and Moriyoshi Ohara. [n. d.]. Work-
SOSP. Farminton, PA, 2013. load characterization for microservices. In Proc. of IISWC. 2016.
[69] Raghu Prabhakar, David Koeplinger, Kevin J. Brown, HyoukJoong Lee, [80] Abhishek Verma, Luis Pedrosa, Madhukar R. Korupolu, David Op-
Christopher De Sa, Christos Kozyrakis, and Kunle Olukotun. 2016. penheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster
Generating Configurable Hardware from Parallel Patterns. In Pro- management at Google with Borg. In Proceedings of the European
ceedings of the Twenty-First International Conference on Architectural Conference on Computer Systems (EuroSys). Bordeaux, France.
Support for Programming Languages and Operating Systems, ASPLOS [81] Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang,
’16, Atlanta, GA, USA, April 2-6, 2016. 651–665. Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang,
[70] Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matthew Feldman, Chen Zheng, Gang Lu, Kent Zhan, Xiaona Li, and Bizhu Qiu. 2014.
Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and BigDataBench: A big data benchmark suite from internet services.
Kunle Olukotun. 2017. Plasticine: A Reconfigurable Architecture For 2014 IEEE 20th International Symposium on High Performance Com-
Parallel Paterns. In Proceedings of the 44th Annual International Sympo- puter Architecture (HPCA) 00 (2014), 488–499. [Link]
sium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June [Link]/10.1109/HPCA.2014.6835958
24-28, 2017. 389–402. [Link] [82] Qingyang Wang, Chien-An Lai, Yasuhiko Kanemasa, Shungeng Zhang,
[71] Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, and Calton Pu. 2017. A Study of Long-Tail Latency in n-Tier Systems:
Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fow- RPC vs. Asynchronous Invocations. In 37th IEEE International Confer-
ers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, ence on Distributed Computing Systems, ICDCS 2017, Atlanta, GA, USA,
Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James June 5-8, 2017. 207–217.
Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi [83] Ian H. Witten, Eibe Frank, and Geoffrey Holmes. [n. d.]. Data Mining:
Xiao, and Doug Burger. 2014. A Reconfigurable Fabric for Accelerat- Practical Machine Learning Tools and Techniques. 3rd Edition.
ing Large-Scale Datacenter Services. In Proc. of the 41st Intl. Symp. on [84] Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. [n. d.].
Computer Architecture. Bubble-flux: precise online QoS management for increased utilization
[72] Gang Ren, Eric Tune, Tipp Moseley, Yixin Shi, Silvius Rus, and in warehouse scale computers. In Proceedings of ISCA. 2013.
Robert Hundt. 2010. Google-Wide Profiling: A Continuous Pro- [85] Hailong Yang, Quan Chen, Moeiz Riaz, Zhongzhi Luan, Lingjia Tang,
filing Infrastructure for Data Centers. IEEE Micro (2010), 65–79. and Jason Mars. 2017. PowerChief: Intelligent Power Allocation for
[Link] Multi-Stage Applications to Improve Responsiveness on Power Con-
[73] Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and strained CMP. In Proceedings of the 44th Annual International Sympo-
John Wilkes. [n. d.]. Omega: flexible, scalable schedulers for large sium on Computer Architecture (ISCA ’17). ACM, New York, NY, USA,
compute clusters. In Proceedings of EuroSys. Prague, Czech Republic, 133–146.
2013. [86] Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chenjie Xu, Chao Ji, and
[74] D. Sidler, G. Alonso, M. Blott, K. Karras, Kees Vissers, and Raymond Wenyun Zhao. 2018. Benchmarking Microservice Systems for Soft-
Carley. [n. d.]. Scalable 10Gbps TCP/IP Stack Architecture for Recon- ware Engineering Research. In Proceedings of the 40th International
figurable Hardware. In Proceedings of FCCM. 2015. Conference on Software Engineering: Companion Proceeedings (ICSE ’18).
[75] D. Sidler, Z. Istvan, and G. Alonso. [n. d.]. Low-Latency TCP/IP Stack ACM, New York, NY, USA, 323–324.
for Data Center Applications. In Proceedings of FPL. 2016. [87] Tao Zhu, Jack Li, Josh Kimball, Junhee Park, Chien-An Lai, Calton Pu,
[76] Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat and Qingyang Wang. 2017. Limitations of Load Balancing Mechanisms
Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chan- for N-Tier Systems in the Presence of Millibottlenecks. In 37th IEEE
dan Shanbhag. 2010. Dapper, a Large-Scale Distributed Systems Tracing International Conference on Distributed Computing Systems, ICDCS
Infrastructure. Technical Report. Google, Inc. [Link] 2017, Atlanta, GA, USA, June 5-8, 2017. 1367–1377.
com/archive/papers/[Link] [88] Yuhao Zhu, Daniel Richins, Matthew Halpern, and Vijay Janapa Reddi.
[77] David Sprott and Lawrence Wilkes. January 2004. Understanding 2015. Microarchitectural Implications of Event-driven Server-side Web
Service-Oriented Architecture, CBDI Forum. Applications. In Proceedings of the 48th International Symposium on
[78] Akshitha Sriraman and Thomas F. Wenisch. 2018. uSuite: A Bench- Microarchitecture (MICRO-48). ACM, New York, NY, USA, 762–774.
mark Suite for Microservices. In 2018 IEEE International Symposium on [Link]

18

Common questions

Powered by AI

Microservices-based applications face challenges with front-end stalls being the primary bottleneck, similar to monolithic services, though to a lesser extent . They require better provisioning for latency-critical constraints, since each microservice must maintain predictability under varying loads . Additionally, the dependencies and interaction between microservices can propagate QoS violations and take longer to recover due to autoscaling inefficiencies, which contrasts sharp recovery times in monolithic services .

Detailed profiling of warehouse-scale computers reveals that tail latencies, front-end bottlenecks, and the architectural need for efficient single-thread performance are critical for handling future cloud applications. The findings highlight that future designs need to address predictability and efficiency in handling microservices, which demand robust single-core performance amidst varying load conditions .

Microservices have better i-cache locality compared to traditional monolithic applications due to their simplicity, resulting in lower i-cache miss rates . Despite this, microservices experience more front-end stalls mainly due to fetch and suffer from similar bottlenecks as traditional applications, albeit to a lesser degree . However, microservices are more sensitive to poor single-thread performance, especially since they need to adhere to stricter tail latency constraints compared to monolithic applications .

Emerging scale-out workloads, including microservices, are significantly impacted by architectural features such as cache misses and inter-core communication. While microservices typically have better i-cache locality due to simplicity, any increase in inter-core communication or increase in cache misses disproportionately affects performance due to their sensitivity to predictability and latency .

Autoscaling mechanisms in microservices take longer to address QoS violations compared to monolithic applications, due to their inability to rapidly identify and allocate resources to the direct source of the performance issue. This delay results in longer queue times and slower recovery from violations, whereas monolithic applications can quickly rebalance loads by instantiating new copies .

Cloud-native microservices-based applications, such as Social Network and E-commerce, distribute computation unevenly across tiers. At low load levels, front-end components like nginx are dominant in latency contribution, but high load shifts bottlenecks to back-end elements like databases. These shifts necessitate robust load management strategies to prevent QoS violations, which microservices find challenging due to their dependency networks .

Microservices exhibit different performance imbalances across tiers based on load. At low load, latency is generally dominated by the front-end, like nginx. However, at high loads, back-end databases and services like writeGraph become primary performance limiters. In e-commerce, additional performance impacts arise due to compute-intensive microservices in high-level languages, with services such as orders, catalogue, and payment being major latency contributors .

'Brawny' cores, which are designed for single-thread performance, generally provide better latency for interactive services despite the power advantages of 'wimpy' cores. Microservices, due to their small computational footprints and stricter tail latency constraints, benefit more from 'brawny' cores which enhance performance predictability .

The sensitivity of a system to frequency scaling is influenced by its reliance on cloud-edge communication. Systems like Swarm, bound by cloud-edge latency rather than computation speed, show less sensitivity to frequency scaling compared to services like Social Network and E-commerce, which become more sensitive as they are limited by compute speed .

Latency-aware resource management is crucial for safety-critical computations because offloading to the cloud can lead to significant delays that may have catastrophic consequences, especially in latency-critical tasks such as obstacle avoidance where timely route adjustments are essential .

You might also like