BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Europe/Stockholm
X-LIC-LOCATION:Europe/Stockholm
BEGIN:DAYLIGHT
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
TZNAME:CEST
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
TZNAME:CET
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20250822T115807Z
LOCATION:Room 6.0D13
DTSTART;TZID=Europe/Stockholm:20250617T163000
DTEND;TZID=Europe/Stockholm:20250617T170000
UID:submissions.pasc-conference.org_PASC25_sess107_msa174@linklings.com
SUMMARY:Speeding Up LLM Inference via Sequential Speculative Decoding
DESCRIPTION:Meiyu Zhong, Noel Teku, and Ravi Tandon (The University of Ari
 zona)\n\nAs Large Language Models (LLMs) grow in size and capability, thei
 r high computational cost poses a major challenge for real-time applicatio
 ns, making efficient inference a critical research problem. Speculative De
 coding (SD) has emerged as a promising technique to accelerate LLM inferen
 ce by leveraging a smaller draft model to generate candidate tokens, which
  are then verified in parallel by a larger target model to ensure statisti
 cal consistency. However, the need for frequent verification calls to the 
 target LLM limits the potential speedup of SD. We propose SPRINTER, which 
 utilizes a low-complexity verifier trained to predict if tokens generated 
 by the draft model would be accepted by the target LLM. By performing appr
 oximate sequential verification, SPRINTER eliminates the need for constant
  verification by the target LLM and is only invoked when a token is deemed
  unacceptable. This significantly reduces the number of calls to the large
 r model, enabling further acceleration. We present a theoretical analysis 
 of SPRINTER, examining the statistical properties of the generated tokens 
 and the expected reduction in latency as a function of the verifier. Our e
 valuations on multiple datasets and model pairs demonstrate that approxima
 te verification can maintain high-quality generation while achieving even 
 greater speedups.\n\nDomain: Applied Social Sciences and Humanities, Life 
 Sciences, Computational Methods and Applied Mathematics\n\nSession Chair: 
 Adam Spannaus (Oak Ridge National Laboratory)\n\n
END:VEVENT
END:VCALENDAR
