Academia.eduAcademia.edu

Scalable inference in latent variable models

2012, Proceedings of the fifth ACM international conference on Web search and data mining - WSDM '12

Scalable Inference In Latent Variable Models Amr Ahmed , Mohamed Aly, Joseph Gonzalez, Shravan Narayanamurthy, Alexander Smola Mo#va#ons:))Data)is)everywhere) Map3Reduce)is)not)the)solu#on) Yahoo! Research & CMU Global)State)Synchroniza#on) Results) •8'Million'documents,'1000'topics,'{100,200,400}'machines,'LDA' •'Red'(symmetric'latency'bound'message'passing)' •'Blue'(asynchronous'bandwidth'bound'message'passing'&'message' scheduling)'' •10x'faster'synchroniza?on'?me'and'10x'faster'snapshots' •Scheduling'improves'10%'already'on'150'machines' •Schedule'message'pairs' •Communicate'with'r'random'machines'simultaneously' •Use'LubyHRackoff'PRPG'for'load'balancing' •Efficiency'guarantee:' Applica#ons:)Temporal)Models) •Good'if'only'a'small'number'of'MapReduce'itera?ons'needed' •Need'to'request'machines'at'each'itera?on'(?me'consuming)' •State'lost'in'between'maps' •Communica?on'only'via'file'I/O' 4)simultaneous)connec#ons)are)sufficient)) Not)a)good)fit)for)many)latent)variable)models)which)are) itera#ve)in)nature))and))relies)on)a)shared)state) Sample'Z' For'users' Sample'Z' For'users' Write'counts' Write'counts' Examples:) )with)labels) Sample'Z' For'users' Sample'Z' For'users' Write'counts' Write'counts' Architecture:))LDA) Barrier) Collect'counts' and'sample'Ω' Unlabeled) Do'nothing' Do'nothing' Do'nothing' Barrier) Read'Ω from" •Start'with'common'state' •Child'stores'old'and'new'state' •Parent'keeps'global'state' •Transmit'differences'asynchronously' ''''''H'Inverse'element'for'difference' ''''''H''Abelian'group'for'commuta?vity'' ''''''''(sum,'logHsum,'cyclic'group,'exponen?al'families)' Barcelona' seafood' Key)distribu#on)and)Fault)Tolerance)) Three)Basic)inference)problems:) •Dedicated'server'for'variables' •'Select'server'via'consistent'hashing' •Storage'is'O(1/k)'per'machine' •'Communica?on'is'O(1)'per'machine'' •Fast'snapshots'O(1/k)'per'machine'' '''(stop'sync'and'dump'state'per'vertex)' millage' fast' Challenges) •Millions'to'billions'of'instances' •Rich'structure'of'data'(ontology,'categories,'tags)' •Model'descrip?on'typically'larger'than'memory'of'a'worksta?on' •Usually'clustering'or'topic'models'do'not'solve'the'problem' •Temporal'structure'of'data' •Side'informa?on'for'variables' •10kH100k'clusters'for'hierarchical'model' •1MH100M'words' •Communica?on'is'an'issue'for'large'state'space' Read'Ω from' Read'Ω from' Gaga' mortgage' Jan' •Essen?ally'infinite'amount'of'data' •Labeling'is'prohibi?vely'' •Even'for'supervised)problems)unlabeled)data)abounds.) •User3understandable)structure)for)representa6on)purposes) •Solu6ons)are)o8en)customized)to)problem' Read'Ω from' •'Distributed'(key,value)'storage'via'ICE' •Background'asynchronous'synchroniza?on' '''''H'single'word'at'a'?me'to'avoid'deadlocks' '''''H'no'need'to'have'joint'dic?onary' '''''H'uses'disk,'network,'cpu'simultaneously' April' millage' July' Oct' Long)term)vs.) short3term) interest)