/ CS 194/294-280 (Advanced LLM Agents) - Lecture 2, Jason Weston Self-Improving Language Models: A Journey Through Recent Advancements Thi s blo g pos t summar izes a tal k on self-imp roving langu age mode ls, focus ing on ho w the y ar e gett ing bett er at reaso ning an d th e techni ques driv ing thi s progr ess. We' ll expl ore th e evolu tion of langu age mode ls, th e limita tions of curr ent syste ms, an d th e innova tive approa ches bei ng devel oped to overc ome the se limita tions. Early Days of Language Modeling Pre-2020: Langu age modeli ng's roo ts tra ce bac k to Clau de Shann on's wor k in th e 1950 s, focus ing on predic ting th e nex t tok en in a seque nce. 2003: Beng io et al . introd uced on e of th e fir st neur al netw ork approa ches, usi ng wor d embedd ings an d a soft max lay er fo r predic tion. The ir conclu sion highli ghted th e nee d fo r archite ctural improve ments an d incre ased capac ity with out impac ting train ing tim e. 2000s: Supp ort Vect or Machi nes (SVM s) domin ated NL P resea rch, hinde ring th e advanc ement of neur al network -based langu age mode ls. 2008: A pap er argu ed fo r end-to -end neur al netw ork train ing fo r NL P task s, a precu rsor to mode rn approa ches. 2013: A unif ied archite cture fo r NL P wa s propo sed, usi ng wor d embedd ings, convolu tional laye rs, an d a soft max laye r, demonst rating pre-tra ining on Wikip edia an d fine-t uning on downst ream task s. Thi s appro ach foresha dowed th e succ ess of mode rn lar ge langu age mode ls. The Rise of Transformers and Reasoning 2014: Th e atten tion mecha nism emerg ed, initi ally fo r mach ine transla tion, enabl ing align ment betw een wor ds in diffe rent langu ages. 2015: "Ba by task s" wer e introd uced to asse ss simp le reaso ning capabil ities, revea ling limita tions in LST Ms. 2017: Th e Transf ormer archite cture revoluti onized NLP , impro ving upo n th e atten tion mecha nism wit h multi- head atten tion an d normali zation. 2018: BER T demonst rated th e effecti veness of mask ed langu age mode ls wit h Transfo rmers. Th e scal ing hypoth esis emerg ed, sugges ting tha t larg er mode ls trai ned on mor e dat a wou ld lea d to bett er perfor mance. Post-2018: A rap id incre ase in th e numb er of pape rs usi ng neur al netwo rks in NL P refle cts th e parad igm shi ft towa rds thi s appro ach. System 1 vs. System 2 Reasoning Th e tal k introd uces tw o typ es of reaso ning: Reaso ning Ty pe Characte ristics Implemen tation in LL Ms Limita tions System 1 React ive, reli es on associ ations Transf ormer netwo rk, dire ct out put Spuri ous correla tions, hallucin ations, jailbr eaking System 2 Delibe rate, effor tful Cha in of Thou ght (CoT ), multi- step reaso ning Requi res mor e computa tional resou rces Improving Reasoning: Prompting and Optimization Prompting Approaches (2022-2023): Cha in of Thou ght (Co T) prompt ing, inclu ding few-s hot promp ting an d "Let 's thi nk step-by -step" instruc tions, signifi cantly impro ved perfor mance on reaso ning task s. Thi s appro ach addre sses Syst em 2 reaso ning by genera ting interme diate ste ps befo re th e fin al answ er. Chain of Verification: Thi s appro ach use s Co T to veri fy th e init ial respo nse, impro ving factua lity an d reduc ing hallucin ations. It invol ves genera ting verific ation quest ions an d usi ng th e mode l's answ ers to ident ify an d corr ect erro rs. System 2 Attention: Thi s techn ique use s Co T to rewr ite instruc tions, remov ing irrele vant par ts an d bia s, lead ing to mor e accur ate an d unbia sed respo nses. Self-Improving Models: The Next Wave Self-Rewarding Language Models: The se mode ls assi gn rewa rds to the ir ow n outpu ts, optimi zing themse lves iterati vely. Thi s appro ach aim s to overc ome th e limita tions of human-in- the-loop evalua tion, whi ch beco mes increas ingly challe nging as mode ls impr ove. LLM as a Judge: Langu age mode ls ar e use d to evalu ate th e qual ity of respon ses, repla cing hum an feedb ack. Thi s is cruc ial fo r conti nued mod el improv ement beyo nd th e capabil ities of aver age hum an evalua tors. Iterative Reasoning Preference Optimization (IRPO): Thi s meth od combi nes Co T genera tion wit h verifi able rewa rds (e . g. , match ing th e fin al answ er to a kno wn corr ect answ er fo r mat h probl ems) to impr ove reaso ning abili ties iterati vely. Th e itera tive proc ess is key , as it allo ws th e mod el to gener ate bett er CoT s wit h eac h itera tion. DeepSeek R1: Thi s mod el demonst rates th e effecti veness of verifi able rewa rds an d itera tive train ing fo r achie ving hig h perfor mance on comp lex reaso ning task s, simi lar to OpenA I's GPT- 4. Th e mod el lear ns to gener ate long er, mor e sophist icated CoT s ove r tim e. Beyond Verifiable Rewards: Thinking LLMs Thought Preference Optimization (TPO): Thi s exte nds IRP O to non-veri fiable tas ks by usi ng LLM s as judg es to evalu ate th e qual ity of CoT s. Init ial iterat ions mig ht sho w perfor mance degrad ation du e to th e mode l's exist ing fine-tu ning, bu t subseq uent iterat ions lea d to improve ments. Meta-Rewarding Language Models: Thi s appro ach introd uces a "meta-j udge" LL M to evalu ate th e judgm ents of th e mai n LLM , furt her impro ving th e qual ity of evalua tions an d lead ing to bett er mod el perfor mance. Thinking LLMs as Judges: Thi s meth od focu ses on genera ting lon g, detai led CoT s fo r evalua tion task s, usi ng synth etic verifi able dat a to tra in th e mode l. Th e us e of evalua tion pla ns an d unconst rained CoT s is cruc ial fo r opti mal perfor mance. Future Directions Beyond Text-Based CoT: Explo ring th e us e of contin uous vect ors inst ead of tex t fo r Co T reaso ning. Improved System 1 Reasoning: Develo ping bett er Transf ormer archite ctures an d atten tion mechan isms. Agent-Based Reasoning: Train ing mode ls to reas on thro ugh intera ction wit h th e worl d, inter net, or themse lves. Self-Awareness: Enabl ing mode ls to unders tand the ir ow n knowl edge an d limita tions. Key Takeaways Thi s tal k highli ghts th e rap id advance ments in self-imp roving langu age mode ls. Th e shi ft fro m human-in- the-loop evalua tion to self-eva luation usi ng LLM s as judg es is a ke y driv er of progr ess. Techni ques lik e verifi able rewa rds an d itera tive train ing ar e cruc ial fo r impro ving bot h reaso ning abili ties an d th e mode ls' capac ity fo r self-eval uation. Futu re resea rch wil l like ly foc us on integr ating the se advance ments, explo ring altern ative Co T metho ds, an d develo ping mor e sophist icated self-eva luation mechan isms to pus h th e bounda ries of AI capabil ities. CS 194/294-280 (Advanced LLM Agents) - Lecture 2, Jason Weston Self-Improving Language Models: An In-Depth Analysis Thi s analy sis delv es int o th e advance ments in self-imp roving langu age mode ls (LLM s), focus ing on ho w the y enha nce reaso ning capabil ities an d achi eve near-h uman or eve n superh uman perform ance. Th e discus sion cove rs histor ical conte xt, curr ent metho ds, an d futu re direct ions. I. Historical Context and the Rise of Neural Networks in NLP Early Language Modeling: Th e conc ept of langu age model ing, predic ting th e nex t wor d in a sequ ence , dat es bac k to Clau de Shann on's wor k in th e 1950 s. Ear ly implemen tations use d statis tical metho ds. - Neural Network Approaches (2000s): Th e 200 3 wor k by Beng io et al . introd uced neur al netwo rks fo r langu age model ing, utili zing word embeddings an d a soft max lay er fo r predic tion. Desp ite init ial succe ss, computa tional limita tions an d th e preval ence of Supp ort Vect or Machi nes (SVM s) hampe red progr ess. - Convolutional Neural Networks (2008): Th e speak er's 200 8 wor k propo sed a unif ied archite cture fo r NL P usi ng convolu tional neur al netwo rks, demonst rating end-to -end train ing on lar ge corp ora lik e Wikipe dia. Thi s wa s a signif icant ste p towa rds mode rn LLM s. - Attention Mechanism and Transformers (2014-2017): Th e introdu ction of th e attention mechanism in 201 4 revoluti onized mach ine transla tion. Th e Transf ormer archite cture (201 7), build ing upo n attent ion, beca me th e domin ant archite cture fo r LLM s. - BERT and Masked Language Modeling (2018): BER T demonst rated th e effecti veness of masked language modeling wit h Transfo rmers, furt her solidi fying th e curr ent parad igm. - The Scaling Hypothesis: Th e succ ess of increas ingly lar ge mode ls (GP T seri es) valid ated th e scaling hypothesis, sugges ting tha t larg er mode ls trai ned on mor e dat a yie ld super ior perform ance. - II. System 1 vs. System 2 Reasoning in LLMs Th e spea ker disting uishes betw een tw o typ es of reaso ning in LLM s: Reaso ning Ty pe Characte ristics LL M Impleme ntation Limita tions System 1 React ive, reli es on associa tions, fas t, effor tless Transf ormer architec ture's dire ct proces sing of inp ut toke ns. Fix ed comp ute pe r tok en. Spuri ous correla tions, hallucin ations, jailbre aking. System 2 Delibe rate, effort ful, slo w, requi res plann ing an d verifi cation Cha in of Thou ght (Co T) prompt ing; genera ting interme diate ste ps befo re fin al outp ut. Requi res mor e computa tional resou rces. System 1: Thi s is th e mode l's initi al, intui tive respo nse, analo gous to hum an intuit ion. It proce sses inp ut toke ns direc tly thro ugh th e neur al netw ork. Howev er, it' s pro ne to erro rs lik e hallucin ations an d spuri ous correla tions. - System 2: Thi s invol ves mor e delibe rate reason ing, simi lar to consc ious thou ght in huma ns. It oft en utili zes Co T prompt ing, whe re th e mod el gener ates interme diate reaso ning ste ps befo re arriv ing at a fin al answ er. Thi s appro ach mitig ates Syst em 1' s limita tions bu t is computat ionally mor e expens ive. - III. Improving Reasoning through Optimization and Self-Improvement Th e cor e foc us shif ts to enhan cing reaso ning thro ugh optimiz ation an d self-impr ovement techni ques: A. Prompting Approaches (2022-2023) Few-shot prompting: Provi ding examp les of desi red Co T reaso ning in th e prom pt to gui de th e mode l's outp ut. - "Let's think step by step" prompting: A simp ler appro ach, instru cting th e mod el to gener ate Co T reaso ning explic itly. - Chain of Verification: Usi ng Co T to veri fy th e mode l's init ial respo nse by genera ting an d answe ring verific ation questi ons. Thi s addre sses factua lity an d halluci nation issu es. - System 2 Attention: Rewri ting instruc tions to remo ve irrele vant inform ation an d bia s befo re th e mod el gener ates a respo nse. - Branch, Solve, Merge: Break ing dow n th e evalua tion tas k int o sub-cri teria (relev ance, clari ty, accur acy, et c. ) fo r mor e compreh ensive judgm ent. - B. Self-Rewarding Language Models Th e limita tions of human-in- the-loop evalua tion fo r increas ingly sophist icated LLM s motiv ate self-rew arding mode ls. LLM as a Judge: Train ing th e LL M to evalu ate it s ow n respon ses, repla cing hum an feedb ack. - Iterative Data Creation and Curation: A closed -loop syst em whe re th e mod el gener ates ne w task s, respon ses, an d rewar ds, iterat ively impro ving it s instruction -following an d evalua tion abilit ies. - Self-Instruct: A techn ique fo r genera ting ne w tas ks usi ng few-s hot prompt ing. - Direct Preference Optimization (DPO): Direc tly optimi zing th e mode l's probabi lities bas ed on prefe rred an d rejec ted respo nse pair s. - Iterative Reasoning Preference Optimization (IRPO): Genera ting Co T reaso ning an d usi ng verifi able rewa rds (e . g. , compa ring th e mode l's answ er to th e kno wn corr ect answ er fo r mat h probl ems) fo r optimiz ation. - C. DeepSeek R1 and the Emergence of Advanced Self-Training Methods DeepS eek R1 , a mod el achie ving perfor mance compar able to OpenA I's GPT- 4, emplo yed a self-tr aining appro ach simi lar to IRP O, utili zing verifi able rewa rds an d itera tive refine ment of Co T reason ing. - D. Thought Preference Optimization (TPO) Thi s meth od exte nds IRP O to non-veri fiable task s, usi ng an LL M as a jud ge to evalu ate Co T respon ses. It demonst rates th e poten tial fo r self-impr ovement acro ss dive rse tas ks beyo nd jus t mat h probl ems. - E. Meta-Rewarding Language Models Thi s appro ach furt her enhan ces self-eva luation by introd ucing a "meta-j udge" – th e LL M evalua ting it s ow n judgme nts. Thi s crea tes a virtu ous cyc le of improv ement in bot h instru ction follo wing an d evalua tion. - F. Thinking LLMs as Judges Thi s focu ses on genera ting detai led Co T reaso ning fo r evalua tion task s, levera ging verifi able rewa rd sign als fro m synth etic dat a to impr ove judgm ent quali ty. - IV. Future Directions Agents: LLM s intera cting wit h th e enviro nment (inter net, real-w orld simulat ions). Synthetic Data: Genera ting train ing dat a to augm ent exist ing datas ets. Inference Time Compute: Utili zing mor e computa tional resou rces duri ng infer ence fo r impro ved reaso ning. Reasoning Understanding: Impro ving th e mode l's underst anding of it s ow n reaso ning proce sses. Self-Awareness: Develo ping mode ls wit h a grea ter underst anding of the ir knowl edge an d limita tions. Improving System 1 Reasoning: Explo ring nov el neur al netw ork archite ctures an d atten tion mechan isms. Beyond Textual CoT: Investi gating non-te xtual represen tations fo r Co T reaso ning. Key Takeaways Self-imp roving LLM s ar e rapi dly advanc ing, push ing th e bounda ries of reaso ning capabil ities. Syst em 1 an d Syst em 2 reaso ning repre sent disti nct approa ches wit h complem entary stren gths an d weakne sses. CS 194/294-280 (Advanced LLM Agents) - Lecture 2, Jason Weston A Study Guide on Self-Improving Language Models Thi s stu dy gui de summar izes ke y conce pts fro m a YouT ube vid eo transc ript on self-imp roving langu age mode ls. Introduction - AI's rapid evolution: Th e fie ld of AI , particu larly langu age mode ls, is rapi dly advanc ing, wit h signif icant improve ments see n in rece nt year s. Thi s progr ess allo ws fo r applica tions lik e ar t creat ion, cod e writi ng, an d comp lex reaso ning. ** Self-imp roving langu age model s:** Th e foc us is on meth ods fo r langu age mode ls to impr ove themse lves, prima rily by enhan cing the ir reaso ning capabil ities. Thi s invol ves introsp ection duri ng train ing, allow ing th e mod el to lea rn fro m it s ow n mista kes an d succe sses. System 1 and System 2 reasoning: Tw o disti nct approa ches to impro ving reaso ning ar e discu ssed: * **System 1:** *Reactive, relying on associations. Similar to a Transformer's neural network processing.* This has limitations, including spurious correlations, hallucinations, and jailbreaking. * **System 2:** *Deliberate and effortful thinking. Often achieved through Chain of Thought prompting.* This allows for planning, search, verification, and other complex reasoning tasks. A Brief History of Language Modeling - Pre-2020 (Prehistory): Langu age modeli ng's orig ins tra ce bac k to Clau de Shann on's wor k in th e 1950 s. Th e cor e conc ept is predic ting th e nex t tok en in a seque nce. 2003: Beng io et al . introd uced a neur al netw ork appro ach to langu age model ing usi ng wor d embedd ings. Thi s ear ly mod el highli ghted th e nee d fo r impro ved archite cture an d computa tional effici ency. 2000s: Resea rch shif ted awa y fro m neur al netwo rks to supp ort vect or machi nes (SVM s). 2008: A pap er argu ed fo r end-to -end neur al netw ork train ing fo r NL P task s, foresha dowing curr ent approa ches. 2013: A unif ied archite cture fo r NL P usi ng convolu tional laye rs an d a proto-at tention mecha nism wa s propo sed. Thi s mod el cou ld be pre-tr ained on lar ge datas ets lik e Wikip edia an d fine-t uned fo r speci fic tas ks. 2014: Th e atten tion mecha nism wa s introd uced, cruc ial fo r th e Transf ormer archite cture. Thi s allo wed fo r align ment betw een wor ds in diffe rent par ts of a sente nce or betw een sente nces. 201 7: Th e Transf ormer archite cture emerg ed, form ing th e bas is of man y curr ent LLM s. 2018: BER T demonst rated th e effecti veness of mask ed langu age mode ls wit h Transfo rmers. Th e scal ing hypoth esis emerg ed, sugges ting tha t larg er mode ls trai ned on mor e dat a wou ld lea d to bett er perfor mance. 201 9-present: Foc us shif ted to explo ring train ing meth ods beyo nd th e langu age mod el objec tive funct ion, incorpo rating reinfor cement learn ing an d hum an feedb ack. Improving Reasoning Through Different Methods - 2019: Self-fe eding chatb ot , usi ng a rewa rd mod el an d hum an feedb ack fo r train ing. 2020: Blende rBot, a pre-tr ained LL M fine-t uned on human-an notated dialo gue dat a. 2022: Instruc tGPT, advoca ting reinfor cement learn ing fro m hum an feedb ack (RLH F). * Thi s invol ved collec ting demonst ration dat a an d compar ison dat a fro m huma ns to tra in a rewa rd mod el. * Direct Preference Optimization (DPO): A simp ler altern ative to RLH F, direc tly optimi zing th e probab ility of prefe rred respo nses. Chain of Thought (CoT) Prompting - : A promp ting techn ique tha t encour ages th e mod el to gener ate interme diate reaso ning ste ps (a "cha in of thoug ht") befo re provi ding a fin al answ er. Thi s signifi cantly impro ves perfor mance on tas ks requi ring reason ing, especi ally in mat h probl ems. Chain of Verification - : A Co T appro ach to addr ess factua lity an d halluci nation issu es. Th e mod el gener ates a dra ft respo nse an d the n ask s itse lf verific ation quest ions to che ck it s accur acy. System 2 Attention - : A Co T meth od to mitig ate seman tic leak age an d prim ing effec ts. Th e mod el rewri tes th e origi nal instru ction to remo ve irrele vant or bias ed par ts. Branch, Solve, Merge - : A Co T appro ach fo r respo nse evalua tion. Th e mod el brea ks dow n th e evalua tion int o diffe rent crite ria (relev ance, clari ty, et c. ) befo re merg ing the m int o a fin al judgm ent. Self-Improving Models - Limitations of RLHF: As LLM s impro ve, it beco mes increas ingly diffi cult an d expen sive fo r huma ns to prov ide accur ate feedb ack, especi ally fo r comp lex tas ks. LLM as a Judge: Usi ng LLM s to evalu ate th e qual ity of model-ge nerated respon ses. Thi s allo ws fo r automa tion of th e evalua tion proc ess. Self-Rewarding Language Models - : Mode ls tha t tra in themse lves by assig ning rewa rds to the ir ow n outpu ts. Thi s invol ves an itera tive proc ess of genera ting task s, respon ses, rewar ds , an d prefer ence pai rs. Iterative Reasoning Preference Optimization (IRPO) - : A self-imp roving meth od tha t incorpo rates Cha in of Thou ght reaso ning an d verifi able rewa rds (fo r tas ks whe re th e corr ect answ er is know n). * Thi s is particu larly effec tive fo r mat h probl ems. * DeepSeek-R1: An indepen dently devel oped mod el tha t achie ves simi lar resu lts to OpenA I's GPT- 4, usi ng a simi lar itera tive train ing appro ach wit h verifi able rewa rds. Thought Preference Optimization (TPO) - : Exte nds IRP O to non-veri fiable task s, usi ng LLM s as judg es to evalu ate Cha in of Thou ght respon ses. Thi s meth od sho ws tha t Cha in of Thou ght train ing ca n be effec tive fo r vari ous tas ks beyo nd mat h probl ems. Advanced Techniques and Future Directions - Meta-Rewarding Language Models - : Mode ls tha t impr ove the ir ow n judgm ent abili ties by meta-ju dging the ir ow n judgme nts . Thi s crea tes a virtu ous cyc le of improv ement in bot h instru ction follo wing an d evalua tion. Thinking LLMs as a Judge - : Focu ses on genera ting lon g chai ns of thou ght fo r evalua tion task s, usi ng verifi able rewa rds whe re possi ble . Coconut - : Repla ces text ual Cha in of Thou ght wit h vector- based reason ing. Thi s offe rs a poten tial altern ative to text ual CoT , wit h poten tial advant ages in cert ain tas ks. Future Research Directions : Th e vid eo concl udes by highlig hting seve ral promi sing aven ues fo r futu re resea rch, inclu ding: Agents: Develo ping LLM s tha t ca n inter act wit h th e wor ld an d lea rn fro m the ir experi ences. Synthetic Data: Usi ng synth etic dat a to augm ent train ing dat a an d impr ove mod el perfor mance. Inference Time Compute: Levera ging computa tional resou rces duri ng infer ence to enha nce reaso ning an d evalua tion. Self-Awareness: Explo ring way s to mak e LLM s mor e awa re of the ir ow n knowl edge an d limita tions. Improved System 1 Reasoning: Develo ping bett er underl ying archite ctures fo r LLM s. Key Takeaways Self-imp roving langu age mode ls ar e a promi sing are a of resea rch, wit h th e poten tial to lea d to signif icant advance ments in AI . Cha in of Thou ght promp ting an d oth er Syst em 2 reaso ning techni ques ar e cruc ial fo r enhan cing th e reaso ning capabil ities of LLM s. Verifi able rewa rds an d LLM s as judg es ar e effec tive meth ods fo r train ing an d evalua ting self-imp roving mode ls. Futu re resea rch wil l like ly foc us on integr ating vari ous self-impr ovement techni ques an d explo ring ne w archite ctures an d train ing meth ods.