0% found this document useful (0 votes)
107 views

MapReduce Questions and Answers Part 1 - Java Code Geeks

MapReduce Questions and Answers

Uploaded by

aepuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views

MapReduce Questions and Answers Part 1 - Java Code Geeks

MapReduce Questions and Answers

Uploaded by

aepuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Home Tutorials JoinUs About Resources JobBoard Examples Whitepapers Academy Library

Java Android JVMLanguages SoftwareDevelopment Agile DevOps Communications Career Misc MetaJCG Searchthissite...

Youarehere:Home SoftwareDevelopment MapReduceQuestionsandAnswersPart1

AboutMariaJurcovicova

Newsletter

MapReduceQuestionsandAnswersPart1 80381insidersarealreadyenjoyin
updatesandcomplimentarywhitepa
byMariaJurcovicovaonMay17th,2012 | Filedin:SoftwareDevelopment Tags:MapReduce

WithallthehypeandbuzzsurroundingNoSQL,Idecidedtohavealookatit.IquicklyfoundthatthereisnotoneNoSQLIcouldlearn.
Jointhemnowtogain
Rather,therearevariousdifferentsolutionswithdifferentpurposesandtradeoffs.Thosevarioussolutionstendtohaveonethingin tothelatestnewsintheJavaworld,
common:processingofdatainNoSQLstorageisusuallydoneusingMapReduceframework. insightsaboutAndroid,Scala,Groov
relatedtechnologies.
SearchonMapReducefoundvariousscatteredblogposts,someuniversitiescoursespagesandonebookthatseemstocontainalmost
everythingothersourcesdid.
Emailaddress:

Youremailaddress
ThispostcontainsMapReducequestionsandanswersbasedonthebook.Basically,ifIwouldbeastudent,thisiswhatIwouldhave
madeasatestpreparationnotes.IfIwouldbeateacher,thisiswhatIwouldaskontheexam.
Signup

Firstchaptergivescreditwherethecreditisdue,therestcontainsquestions.Lastchaptercontainshandsoncodingexercises.
JoinUs
TheBook
With
ThebookisnamedDataIntensiveTextProcessingwithMapReduce.Ifyouareunsurewhetheryouwanttobuyitornot,pre uniquevisitors
productionmanuscriptisavailableforfree. authorswear
thetopJavar
Donotbefooledbythetitle.ThebookismoreaboutMapReduceitselfandlessabouttextprocessing.Firsthalfofthebookdescribes around.Cons
generalpurposetricks(designpatterns)usefulforanytask.Secondhalfcontainsachaptersontextprocessing,graphalgorithmsand thelookoutfo
expectationmaximization. encourageyo
Ifyouhaveablogwithuniqueandin
ThebookcontainsalmosteverythingIfoundonvariousblogs,universitycoursespagesandmuchmore. contentthenyoushouldcheckouto
partnersprogram.Youcanalsobe
Questions forJavaCodeGeeksandhoneyou

Questionsaresplitbybookchapters.Withoneminorexception(counters),questionsaremostlybasedonthefirsthalfofthebook.
SecondhalfcontainsconcretealgorithmsandIwantedtofocusongeneralpurposeknowledge.

Itdoesnotmeanthatlearningthemisnotuseful.EspeciallyGraphAlgorithmschaptercontainseasilygeneralizableideas.

2MapReduceBasics

2.2MappersandReducers

DescribegeneralMapReducealgorithm.Splititintophases.Foreachphaseinclude:

whoisresponsible(framework/programmer/customizable),
whatitdoes,
phaseinput,
phaseoutput.

Theansweris:

MapReducehasfourphases:

map,
combine,
shuttleandsort,
reduce.

Mapphaseisdonebymappers.Mappersrunonunsortedinputkey/valuespairs.Thesamephysicalnodesthatkeepsinputdatarun
alsomappers.Eachmapperemitszero,oneormultipleoutputkey/valuepairsforeachinputkey/valuepair.Outputkey/valuepairsare
calledintermediatekey/valuepairs.Theirtypeisusuallydifferentfrominputkey/valuepairtype.Mappermustbesuppliedby
programmers.
Combinephaseisdonebycombiners.Combinershouldcombinekey/valuepairswiththesamekeytogether.Eachcombinermayrun
zero,onceormultipletimes.Frameworkdecideswhetherandhowmanytimestorunthecombiner,programmerhasnocontroloverit.
Combinersoutputkey/valuepairtypemustbethesameasitsinputkey/valuepairtypes.

Shuttleandsortphaseisdonebyframework.Datafromallmappersaregroupedbythekey,splitamongreducersandsortedbythe
key.Eachreducerobtainsallvaluesassociatedwiththesamekey.Programmermaysupplycustomcomparefunctionforsortingand
partitionerfordatasplit.Allkey/valuepairsgoingtothesamereduceraresortedbythekey,butthereisnoglobalsorting.

Reducerobtainssortedkey/[valueslist]pairssortedbythekey.Valueslistcontainsallvalueswiththesamekeyproducedbymappers.
Eachreduceremitszero,oneormultipleoutputkey/valuepairsforeachinputkey/valuepair.Outputkey/valuepairtypeisusually
differentfrominputkey/valuepairtype.Reducermustbesuppliedbyprogrammers.

IfthealgorithmrequiresmultipleMapReduceiterations,eachcombinermayincrementglobalcounter.Driverprogramwouldreadthe
counterafterthereducephase.Itthendecideswhethernextiterationisneededornot.

Note:chapter2doesnotmentioncounters.Theyareexplainedlater,inthechapter5.Decideifthestatementistrueorfalse:All
MapReduceimplementationsimplementexactlysamealgorithm.

Decideifthestatementistrueorfalse:AllMapReduseimplementationsimplementexactlysamealgorithm.
Theansweris:

False.Forexample,Googlesimplementationdoesnotallowchangeofkeyinthereducer,butprovidessortingforvalues.Hadoopdoes
notprovidevaluessorting,butreducercanchangethekey.

Trueorfalse:Eachmappermustgeneratethesamenumberofkey/valuepairsasitsinputhad.

Theansweris:

False.Mappermaygenerateanynumberofkey/valuepairs(includingzero).

Trueorfalse:Mappersinputkey/valuepairsaresortedbythekey.

Theansweris:

False.Mappersinputisnotsortedinanyway.

Trueorfalse:Mappersoutputkey/valuemustbeofthesametypeasitsinput.

Theansweris:

False.Mappermayproducekey/valuepairsofanytype.

Trueorfalse:Reducerisappliedtoallvaluesassociatedwiththesamekey.

Theansweris:

True.Reducerisappliedtoallvaluesassociatedwiththesamekey.

Trueorfalse:Reducersinputkey/valuepairsaresortedbythekey.

Theansweris:

True.Reducersinputkey/valuepairsaresortedbythekey.
implementation.

Trueorfalse:Eachreducermustgeneratethesamenumberofkey/valuepairsasitsinputhad.

Theansweris:

False.Reducermaygenerateanynumberofkey/valuepairs(includingzero).

Trueorfalse:Reducersoutputkey/valuepairmustbeofthesametypeasitsinput.

Theansweris:

False.ThestatementisfalseinHadoopandtrueinGooglesimplementation.

2.3TheExecutionFramework

Whathappensincaseofhardware/softwarefailure?

Theansweris:

MapReduceframeworkmustbeabletorecoverfrombothhardware(diskfailures,RAMerrors)andsoftware(bugs,unexpected
exceptions)errors.Botharecommonandexpected.

Isitpossibletostartreducerswhilesomemappersstillrun?Why?
Theansweris:

No.Reducersinputisgroupedbythekey.Thelastmappercouldtheoreticallyproducekeyalreadyconsumedbyrunningreducer.

Defineastraggler.

Theansweris:

Straggleriseithermaporreducetaskthattakesunusuallylongtimetocomplete.

Whatisspeculativeexecution(alsocalledbackuptasks)?Whatproblemdoesitsolve?

Theansweris:

Identicalcopyofthesametaskisexecutedonmultiplenodes.Outputofthefastesttaskused.
Speculativeexecutionhelpsifthetaskisslowbecauseofhardwareproblem.Itdoesnothelpifthedistributionofvaluesoverkeysis
skewed.

2.4PartitionersandCombiners

Whatdoespartitionerdo?

Theansweris:

Partitionerdivideskey/valuespairsproducedbymaptasksbetweenreducers.

Whatdoescombinerdo?

Theansweris:

Combinerdoeslocalaggregationofkey/valuespairsproducedbymapperbeforeorduringshuttleandsortphase.Ingeneral,it
reducesamountofdatatobetransferredbetweennodes.

Theframeworkdecideshowmanytimestorunit.Combinermayrunzero,oneormultipletimesonthesameinput.

Decideifthestatementistrueorfalse:Eachcombinerrunsexactlyonce.

Theansweris:

False.Theframeworkdecideswhethercombinerrunszero,onceormultipletimes.

2.5TheDistributedFileSystem

BrieflydescribeHDFSarchitecture.

Theansweris:

HDFShasonenamenodeandalotofdatanodes.Namenodeismasterandcoordinatesfileoperations,ensuresintegrityofthesystem
andkeepsnamespace(metadata,directorystructure,filetoblockmappingetc.).

Dataarestoredinbigblocksondatanodes.Eachblockisstoredonmultiple,bydefaultthree,datanodes.Namenodecheckswhether
datanodesworkcorrectlyandmanagesdatareplication.

Clientcontactsnamenodewhichanswerswithdatablockidanddatanodeid.Datanodethensendsdatadirectlytotheclient.

Decideifthestatementistrueorfalse:HDFSisgoodatmanaginglargenumberofsmallfiles.

Theansweris:

False.HDFSisgoodatmanaginglargefiles.

2.6HadoopClusterArchitecture

Explaindifferencebetweenjobtrackerandtasktracker?

Theansweris:

Clientexecutesjobsonjobtracker.Jobtrackerrunsonthemaster.JobtrackermonitorsMapReducejobs.Italsocoordinatesmappers
andreducers.

Tasktrackerrunsbothusercodeanddatanodedaemononslavenodes.Itisnevercontactedbytheclient.

Explainmapperlifecycle.

Theansweris:

Initializationmethodiscalledbeforeanyothermethodiscalled.Ithasnoparametersandnooutput.

Mapmethodiscalledseparatelyforeachkey/valuepair.Itprocessinputkey/valuepairsandemitsintermediatekey/valuepairs.
Closemethodrunsafterallinputkey/valuehavebeenprocessed.Themethodshouldcloseallopenresources.Itmayalsoemit
key/valuepairs.

Explainreducerlifecycle.

Theansweris:

Initializationmethodiscalledbeforeanyothermethodiscalled.Ithasnoparametersandnooutput.

Reducemethodiscalledseparatelyforeachkey/[valueslist]pair.Itprocessintermediatekey/valuepairsandemitsfinalkey/value
pairs.Itsinputisakeyanditeratoroverallintermediatevaluesassociatedwiththesamekey.

Closemethodrunsafterallinputkey/valuehavebeenprocessed.Themethodshouldcloseallopenresources.Itmayalsoemit
key/valuepairs.

3MapReduceAlgorithmDesign

3.1LocalAggregation

Whatislocalaggregationandwhyisitused?

Theansweris:

Eithercombineroramappercombineskey/valuepairswiththesamekeytogether.Theymaydoalsosomeadditionalpreprocessingof
combinedvalues.Onlykey/valuepairsproducedbythesamemapperarecombined.

Key/Valuepairscreatedbymaptasksaretransferredbetweennodesduringshuffleandsortphase.Localaggregationreducesamount
ofdatatobetransferred.

Ifthedistributionofvaluesoverkeysisskewed,datapreprocessingincombinerhelpstoeliminatereducestragglers.

Whatisinmappercombining?Stateadvantagesanddisadvantagesoverwritingcustomcombiner.

Theansweris:

Localaggregation(combiningofkey/valuepairs)doneinsidethemapper.

Mapmethoddoesnotemitkey/valuepairs,itonlyupdatesinternaldatastructure.Closemethodcombinesandpreprocessallstored
dataandemitsfinalkey/valuepairs.Internaldatastructureisinitializedininitmethod.

Advantages:

Itwillrunexactlyonce.Combinermayrunmultipletimesornotatall.
Wearesureitwillrunduringmapphase.Combinermayruneitheraftermapphaseorbeforereducephase.Thelattercase
providesnoreductionintransferreddata.
Inmappercombiningistypicallymoreeffective.Combinerdoesnotreduceamountofdataproducedbymappers,itonlygroups
generateddatatogether.Thatcausesunnecessaryobjectcreation,destruction,serializationanddeserialization.

Disadvantages:

Scalabilitybottleneck:thetechniquedependsonhavingenoughmemorytostoreallpartialresults.Wehavetoflushpartialresults
regularlytoavoidit.Combineruseproducenoscalabilitybottleneck.

3.2PairsandStripes

ExplainPairdesignpatteronacooccurenceexample.Includeadvantages/disadvantagesagainstStripesapproach,possible
optimizationsandtheirefficacy.

Theansweris:

Mappergenerateskeyscomposedfrompairsofwordsthatoccurredtogether.Thevaluecontainsthenumber1.Frameworkgroups
key/valuepairswiththesameworkpairstogetherandreducersimplycountsthenumbervaluesforeachincomingkey/valuepairs.

Eachfinalpairencodesacellincooccurrencematrix.Localaggregation,e.g.combinerorinmappercombining,canbeused.

Advantages:

Simplevalues,lessserialization/deserializationoverhead.
Simplermemorymanagement.Noscalabilitybottleneck(onlyifinmapperoptimizationwouldbeused).

Disadvantages:

Hugeamountofintermediatekey/valuepairs.Shuffleandsortphaseisslower.
Localaggregationislesseffectivetoomanydistinctkeys.
ExplainStripesdesignpatteronacooccurenceexample.Includeadvantages/disadvantagesagainstPairsapproach,
possibleoptimizationsandtheirefficacy.

Theansweris:

Mappergeneratesadistinctkeyfromeachencounteredword.Associatedvaluecontainsamapofallcooccurredwordsasmapkeys
andnumberofcooccurrencesasmapvalues.Frameworkgroupssamewordstogetherandreducermergesvaluemaps.

Eachfinalpairencodesarowincooccurrencematrix.Combinerorinmappercombiningcanbeused.

Advantages:

Smallamountofintermediatekey/valuepairs.Shuffleandsortphaseisfaster.
Intermediatekeysaresmaller.
Effectivelocalaggregationsmallernumberofdistinctkeys.

Disadvantages:

Complexvalues,moreserialization/deserializationoverhead.
Morecomplexmemorymanagement.Asvaluemapsmaygrowtoobig,theapproachhaspotentialforscalabilitybottleneck.

Explainscalabilitybottleneckcausedbystripesapproach.

Theansweris:

Stripessolutionkeepsamapofcooccurredwordsinmemory.Astheamountofcooccurredwordsisunlimited,themapsizeis
unlimitedtoo.Hugemapdoesnotfitintothememoryandcausespagingoroutofmemoryerrors.

3.3ComputingRelativeFrequencies

Relativefrequenciesofcooccurrencesproblem:

Input:textdocuments
key:documentid
value:textdocument

Output:key/valuepairswhere
key:pair(word1,word2)
value:#cooccurrences(word1,word2)/#cooccurrences(word1,anyword)

Fixfollowingsolutiontorelativefrequenciesofcooccurrencesproblem:

01 classMAPPER
02 methodINITIALIZE
03 H=newhashmap
04
05 methodMAP(docida,docd)
06 foralltermwindocddo
07 foralltermupatrineighbors(w)do
08 H(w)=H(w)+1
09 emit(pair(u,w),count1)
10
11 methodCLOSE
12 foralltermwinH
13 emit(pair(w,*),H(w))
14
15 classREDUCER
16 variabletotal_occurrences=0
17
18 methodREDUCE(pair(p,u),counts[c1,c2,...,cn])
19 s=0
20 forallcincounts[c1,c2,...,cn]do
21 s=s+c
22
23 ifu=*
24 total_occurrences=s
25 else
26 emit(pairp,s/total_occurrences)
27
28 classSORTING_COMPARATOR
29 methodcompare(key(p1,u1),key(p2,u2))
30 ifp1=p2ANDu1=*
31 returnkey1islower
32
33 ifp1=p2ANDu2=*
34 returnkey2islower
35
36 returncompare(p1,p2)

Theansweris:

Partitionerismissing,frameworkcouldsendkey/valuepairswithtotalstodifferentreducerthankey/pairswithwordpairs.

1 classPARTITIONING_COMPARATOR
2 methodcompare(key(p1,u1),key(p2,u2))
3 ifp1=p2
4 returnkeysareequal
5
6 returnkeysaredifferent
Describeorderinversiondesignpattern.

Theansweris:

Orderinversionisusedifthealgorithmrequirestwopassesthroughmappergeneratedkey/valuepairswiththesamekey.Thefirst
passgeneratessomeoverallstatisticwhichisthenappliedtodataduringthesecondpass.Thereducerwouldneedtobufferdatain
thememoryjusttobeabletopasstwicethroughthem.

Firstpassresultiscalculatedbymappersandstoredinsomeinternaldatastructure.Themapperemitstheresultinclosingmethod,
afterallusualintermediatekey/valuepairs.

Thepatternrequirescustompartitioningandsort.Firstpassresultmustcometothereducerbeforeusualkey/valuepairs.Ofcourse,it
mustcometothesamereducer.

3.4SecondarySorting

Describevaluetokeydesignpattern.

Theansweris:

Hadoopimplementationdoesnotprovidesortingforgroupedvaluesinreducersinput.Valuetokeyisusedasaworkaround.

Partofthevalueisaddedtothekey.Customsortthensortsprimarybythekeyandsecondarybytheaddedvalue.Custompartitioner
mustmovealldatawiththesameoriginalkeytothesamereducer.

3.5RelationalJoins

Describereducesidejoinbetweentableswithoneononerelationship.

Theansweris:

Mapperproduceskey/valuepairswithjoinidsaskeysandrowvaluesasvalue.Correspondingrowsfrombothtablesaregrouped
togetherbytheframeworkduringshuffleandsortphase.

Reducemethodinreducerobtainsjoinidandtwovalues,eachrepresentsrowfromonetable.Reducerjoinsthedata.

Describereducesidejoinbetweentableswithonetomanyrelationship.

Theansweris:

WeassumethatthejoinkeyisprimarykeyintablecalledS.SecondtableiscalledT.Inotherwords,thetableSinontheonesideof
therelationshipandthetableTisonthemanysideoftherelationship.

Wehavetoimplementmapper,customsorter,partitionerandreducer.

Mapperproduceskeycomposedfromjoinidandtableflag.Partitionersplitsthedatainsuchaway,thatallkey/valuepairswiththe
samejoinidgoestothesamereducer.Customsortputskey/valuepairgeneratedfromthetableSrightbeforekey/valuepairwiththe
samejoinidfromthetableT.

Reducersinputlookslikethis:
((JoinId1,s)>row)
((JoinId1,t)>[rows])
((JoinId2,s)>row)
((JoinId2,t)>[rows])
...
((JoinIdn,s),row)
((JoinIdn,t),[rows])

Thereducerjoinsallrowsfrom s pairwithallrowsfromfollowing t pair.

Describereducesidejoinbetweentableswithmanytomanyrelationship.

Theansweris:

WeassumethatdataarestoredintablescalledSandT.ThetableSissmaller.Wehavetoimplementmapper,customsorter,
partitionerandreducer.

Mapperproduceskeycomposedfromjoinidandtableflag.Partitionersplitsthedatainsuchaway,thatallkey/valuepairswiththe
samejoinidgoestothesamereducer.Customsortputsthekey/valuepairsgeneratedfromthetableSisrightbeforeallkey/valuepair
withthedatafromthetableT.

Reducersinputlookslikethis:
((JoinId1,s)>[rows])
((JoinId1,t)>[rows])
((JoinId2,s)>[rows])
((JoinId2,t)>[rows])
...
((JoinIdn,s),[rows])
((JoinIdn,t),[rows])

ThereducerbuffersallrowswiththesameJoinIdfromthetableSintothememoryandjoinsthemwithfollowingTtablerows.

Alldatafromthesmallertablemustfitintothememorythealgorithmhasscalabilitybottleneckproblem.

Describemapsidejoinbetweentwodatabasetables.

Theansweris:

Mapsidejoinworksonlyiffollowingassumptionshold:

bothdatasetsaresortedbythejoinkey,
bothdatasetsarepartitionedthesameway.

Mappermapsoverlargerdatasetandreadscorrespondingpartofsmallerdatasetinsidethemapper.Asthesmallersetispartitioned
thesamewayasbiggerone,onlyonemaptaskaccessthesamedata.Asthedataaresortedbythejoinkey,wecanperformmerge
join O(n) .

Describememorybackedjoin.

Theansweris:

Smallersetofdataisloadedintothememoryineverymapper.Mappersloopoverlargerdatasetandjoinsitwithdatainthememory.If
thesmallersetistoobigtofitintothememory,datasetisloadedintomemcachedorsomeothercachingsolution.

Whichoneisfaster?Mapsidejoinorreducesidejoin?

Theansweris:

Mapsidejoinisfaster.

GotoPart2

Reference:MapReduceQuestionsandAnswersfromourJCGpartnerMariaJurcovicovaattheThisisStuffblog.

DoyouwanttoknowhowtodevelopyourskillsettobecomeaJavaRockstar?

SubscribetoournewslettertostartRockingrightnow!
TogetyoustartedwegiveyouourbestsellingeBooksforFREE!

1.JPAMiniBook
2.JVMTroubleshootingGuide
3.JUnitTutorialforUnitTesting
4.JavaAnnotationsTutorial
5.JavaInterviewQuestions
6.SpringInterviewQuestions
7.AndroidUIDesign

andmanymore....

Emailaddress: Youremailaddress Signup!

LeaveaReply

Name(Required)

Mail(willnotbepublished)(Required)

Website
seven =5
Notifymeoffollowupcommentsviaemail.Youcanalsosubscribewithoutcommenting.
Signmeupforthenewsletter!

SubmitComment

KnowledgeBase Partners HallOfFame AboutJavaCodeGeeks

Academy Mkyong AndroidFullApplicationTutorialseries JCGs(JavaCodeGeeks)isanindependentonlinecommunityfo

Examples GWT2Spring3JPA2Hibernate3.5Tutorial creatingtheultimateJavatoJavadevelopersresourcecenterta


technicalarchitect,technicalteamlead(seniordeveloper),proje
TheCodeGeeksNetwork
Library AdvantagesandDisadvantagesofCloud juniordevelopersalike.JCGsservetheJava,SOA,AgileandTe
JavaCodeGeeks ComputingCloudcomputingprosandcons communitieswithdailynewswrittenbydomainexperts,articles,
Resources
AndroidGoogleMapsTutorial reviews,announcements,codesnippetsandopensourceprojec
.NETCodeGeeks
Tutorials
WebCodeGeeks AndroidLocationBasedServicesApplication
Whitepapers
GPSlocation

11OnlineLearningwebsitesthatyoushould
checkout

JavaBestPracticesVectorvsArrayListvs
HashSet

AndroidJSONParsingwithGsonTutorial

AndroidQuickPreferencesTutorial

DifferencebetweenComparatorand
ComparableinJava

JavaCodeGeeksandallcontentcopyright20102015,ExelixisMediaLtd|TermsofUse|PrivacyPolicy|Contact
AlltrademarksandregisteredtrademarksappearingonJavaCodeGeeksarethepropertyoftheirrespectiveowners.
JavaisatrademarkorregisteredtrademarkofOracleCorporationintheUnitedStatesandothercountries.
JavaCodeGeeksisnotconnectedtoOracleCorporationandisnotsponsoredbyOracleCorporation.

You might also like