Puppet: System Administration Automated

Support

Ticket #1010 (closed defect: fixed)

Opened 7 months ago

Last modified 5 months ago

puppet/puppetmaster randomly corrupts file{} resources, seemingly after leaking RAM for some time

Reported by: Fujin Assigned to: luke
Priority: highest Milestone: 0.24.4
Component: library Version: 0.24.2
Severity: critical Keywords: puppet puppetmaster corrupt file{}
Cc: Triage Stage: Needs design decision
Attached Patches: None Complexity: Unknown

Description

Well, so there I was, sitting around, and suddenly all of the NRPE clients on my boxes died.

Manage to track it down to Puppet:

Jan 17 21:32:45 puppet puppetd[5768]: Caching catalog at /var/lib/puppet/state/localconfig.yaml
Jan 17 21:32:45 puppet puppetd[5768]: Starting catalog run
Jan 17 21:32:45 puppet crontab[6312]: (root) LIST (root)
Jan 17 21:32:46 puppet puppetd[5768]: (//Node[puppet]/puppetmaster/File[/etc/apt/sources.list]) Filebucketed to main with sum %23%23%23+WARNING+THIS+FILE+IS+CONTROLLED+BY+PUPPET%0A%23%23%23%23+ANY+CHA
NGES+MADE+NOT+VIA+PUPPET+WILL+BE+OVERWRITTEN%0A%0Adeb+http%3A%2F%2Fmaxrepo.maxnet.net.nz%2F+feisty+universe+%0A%0Adeb+http%3A%2F%2Fubuntu.maxnet.net.nz%2Fubuntu%2F+feisty+main+restricted+universe+mult
iverse%0Adeb+http%3A%2F%2Fubuntu.maxnet.net.nz%2Fubuntu%2F+feisty-updates+main+restricted+universe+multiverse%0Adeb+http%3A%2F%2Fubuntu.maxnet.net.nz%2Fubuntu%2F+feisty-backports+main+restricted+unive
rse+multiverse%0Adeb+http%3A%2F%2Fsecurity.maxnet.net.nz%2Fubuntu%2F+feisty-security+main+restricted+universe+multiverse%0A%0Adeb-src+http%3A%2F%2Fubuntu.maxnet.net.nz%2Fubuntu%2F+feisty+main+restrict
ed+universe+multiverse%0Adeb-src+http%3A%2F%2Fubuntu.maxnet.net.nz%2Fubuntu%2F+feisty-updates+main+restricted+universe+multiverse%0Adeb-src+http%3A%2F%2Fubuntu.maxnet.net.nz%2Fubuntu%2F+feisty-backpor
ts+main+restricted+universe+multiverse%0A
Jan 17 21:32:46 puppet puppetd[5768]: (//Node[puppet]/puppetmaster/File[/etc/apt/sources.list]/source) replacing from source puppet:///files/sources.list with contents {md5}bbf106a50a57848b97e7b82d5c0
3dfe3
Jan 17 21:32:46 puppet puppetd[5768]: (//Node[puppet]/puppetmaster/File[/etc/apt/sources.list]) Scheduling refresh of Exec[/usr/bin/apt-get update]
Jan 17 21:32:46 puppet puppetd[5768]: (//Node[puppet]/puppetmaster/File[/etc/apt/sources.list]) Scheduling refresh of Exec[/usr/bin/apt-get update]
Jan 17 21:32:47 puppet puppetd[5768]: (//Node[puppet]/puppetmaster/File[/etc/bash.bashrc]/source) No specified sources exist
Jan 17 21:32:47 puppet puppetd[5768]: (//Node[puppet]/monitoring/File[/etc/nagios/nrpe.cfg]) Filebucketed to main with sum %23%23%23+WARNING+THIS+FILE+IS+CONTROLLED+BY+PUPPET%0A%23%23%23+ANY+CHANGES+M
ADE+NOT+VIA+PUPPET+WILL+BE+OVERWRITTEN%0A%0A%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23
%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%0A%23+Sample+NRPE+Config+File+%0A%23+Written+by%3A+Ethan+Galstad+%28nagios%40nagios.org%29%0A%23+%0A%23+Last+Modified%3A+02-
23-2006%0A%23%0A%23+NOTES%3A%0A%23+This+is+a+sample+configuration+file+for+the+NRPE+daemon.++It+needs+to+be%0A%23+located+on+the+remote+host+that+is+running+the+NRPE+daemon%2C+not+the+host%0A%23+from+
which+the+check_nrpe+client+is+being+executed.%0A%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%
23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%23%0A%0A%0A%23+PID+FILE%0A%23+The+name+of+the+file+in+which+the+NRPE+daemon+should+write+it%27s+process+ID%0A%23+number.++
The+file+is+only+written+if+the+NRPE+daemon+is+started+by+the+root%0A%23+user+and+is+running+in+standalone+mode.%0A%0Apid_file%3D%2Fvar%2Frun%2Fnrpe.pid%0A%0A%0A%0A%23+PORT+NUMBER%0A%23+Port+number+we
+should+wait+for+connections+on.%0A%23+NOTE%3A+This+must+be+a+non-priviledged+port+%28i.e.+%3E+1024%29.%0A%23+NOTE%3A+This+option+is+ignored+if+NRPE+is+running+under+either+inetd+or+xinetd%0A%0Aserver
_port%3D5666%0A%0A%0A%0A%23+SERVER+ADDRESS%0A%23+Address+that+nrpe+should+bind+to+in+case+there+are+more+than+one+interface%0A%23+and+you+do+not+want+nrpe+to+bind+on+all+interfaces.%0A%23+NOTE%3A+This
+option+is+ignored+if+NRPE+is+running+under+either+inetd+or+xinetd%0A%0A%23server_address%3D192.168.1.1%0A%0A%0A%0A%23+NRPE+USER%0A%23+This+determines+the+effective+user+that+the+NRPE+daemon+should+ru
n+as.++%0A%23+You+can+either+supply+a+username+or+a+UID.%0A%23+%0A%23+NOTE%3A+This+option+is+ignored+if+NRPE+is+running+under+either+inetd+or+xinetd%0A%0Anrpe_user%3Dnagios%0A%0A%0A%0A%23+NRPE+GROUP%0
A%23+This+determines+the+effective+group+that+the+NRPE+daemon+should+run+as.++%0A%23+You+can+either+supply+a+group+name+or+a+GID.%0A%23+%0A%23+NOTE%3A+This+option+is+ignored+if+NRPE+is+running+under+e
ither+inetd+or+xinetd%0A%0Anrpe_group%3Dnagios%0A%0A%0A%0A%23+ALLOWED+HOST+ADDRESSES%0A%23+This+is+an+optional+comma-delimited+list+of+IP+address+or+hostnames+%0A%23+that+are+allowed+to+talk+to+the+NR
PE+daemon.%0A%23%0A%23+Note%3A+The+daemon+only+does+rudimentary+checking+of+the+client%27s+IP%0A%23+address.++I+would+highly+recommend+adding+entries+in+your+%2Fetc%2Fhosts.allow%0A%23+file+to+allow+o
nly+the+specified+host+to+connect+to+the+port%0A%23+you+are+running+this+daemon+on.%0A%23%0A%23+NOTE%3A+This+option+is+ignored+if+NRPE+is+running+under+either+inetd+or+xinetd%0A%0A%23allowed_hosts%3D1
27.0.0.1%2C192.168.0.2%0A+%0A%0A%0A%23+COMMAND+ARGUMENT+PROCESSING%0A%23+This+option+determines+whether+or+not+the+NRPE+daemon+will+allow+clients%0A%23+to+specify+arguments+to+commands+that+are+execut
ed.++This+option+only+works%0A%23+if+the+daemon+was+configured+with+the+--enable-command-args+configure+script%0A%23+option.++%0A%23%0A%23+%2A%2A%2A+ENABLING+THIS+OPTION+IS+A+SECURITY+RISK%21+%2A%2A%2
A+%0A%23+Read+the+SECURITY+file+for+information+on+some+of+the+security+implications%0A%23+of+enabling+this+variable.%0A%23%0A%23+Values%3A+0%3Ddo+not+allow+arguments%2C+1%3Dallow+command+arguments%0A
%0Adont_blame_nrpe%3D1%0A%0A%0A%0A%23+COMMAND+PREFIX%0A%23+This+option+allows+you+to+prefix+all+commands+with+a+user-defined+string.%0A%23+A+space+is+automatically+added+between+the+specified+prefix+s
tring+and+the%0A%23+command+line+from+the+command+definition.%0A%23%0A%23+%2A%2A%2A+THIS+EXAMPLE+MAY+POSE+A+POTENTIAL+SECURITY+RISK%2C+SO+USE+WITH+CAUTION%21+%2A%2A%2A%0A%23+Usage+scenario%3A+%0A%23+E
xecute+restricted+commmands+using+sudo.++For+this+to+work%2C+you+need+to+add%0A%23+the+nagios+user+to+your+%2Fetc%2Fsudoers.++An+example+entry+for+alllowing+%0A%23+execution+of+the+plugins+from+might+
be%3A%0A%23%0A%23+nagios++++++++++ALL%3D%28ALL%29+NOPASSWD%3A+%2Fusr%2Flib%2Fnagios%2Fplugins%2F%0A%23%0A%23+This+lets+the+nagios+user+run+all+commands+in+that+directory+%28and+only+them%29%0A%23+with
out+asking+for+a+password.++If+you+do+this%2C+make+sure+you+don%27t+give%0A%23+random+users+write+access+to+that+directory+or+its+contents%21%0A%0A%23+command_prefix%3D%2Fusr%2Fbin%2Fsudo+%0A%0A%0A%0A
%23+DEBUGGING+OPTION%0A%23+This+option+determines+whether+or+not+debugging+messages+are+logged+to+the%0A%23+syslog+facility.%0A%23+Values%3A+0%3Ddebugging+off%2C+1%3Ddebugging+on%0A%0Adebug%3D0%0A%0A%
0A%0A%23+COMMAND+TIMEOUT%0A%23+This+specifies+the+maximum+number+of+seconds+that+the+NRPE+daemon+will%0A%23+allow+plugins+to+finish+executing+before+killing+them+off.%0A%0Acommand_timeout%3D60%0A%0A%0
A%0A%23+WEEK+RANDOM+SEED+OPTION%0A%23+This+directive+allows+you+to+use+SSL+even+if+your+system+does+not+have%0A%23+a+%2Fdev%2Frandom+or+%2Fdev%2Furandom+%28on+purpose+or+because+the+necessary+patches%
0A%23+were+not+applied%29.+The+random+number+generator+will+be+seeded+from+a+file%0A%23+which+is+either+a+file+pointed+to+by+the+environment+valiable+%24RANDFILE%0A%23+or+%24HOME%2F.rnd.+If+neither+ex
ists%2C+the+pseudo+random+number+generator+will%0A%23+be+initialized+and+a+warning+will+be+issued.%0A%23+Values%3A+0%3Donly+seed+from+%2Fdev%2F%5Bu%5Drandom%2C+1%3Dalso+seed+from+weak+randomness%0A%0A
%23allow_weak_random_seed%3D1%0A%0A%0A%0A%23+INCLUDE+CONFIG+FILE%0A%23+This+directive+allows+you+to+include+definitions+from+an+external+config+file.%0A%0A%23include%3D%3Csomefile.cfg%3E%0A%0A%0A%0A%2
3+INCLUDE+CONFIG+DIRECTORY%0A%23+This+directive+allows+you+to+include+definitions+from+config+files+%28with+a%0A%23+.cfg+extension%29+in+one+or+more+directories+%28with+recursion%29.%0A%0A%23include_d
ir%3D%3Csomedirectory%3E%0A%23include_dir%3D%3Csomeotherdirectory%3E%0A%0A%0A%0A%23+COMMAND+DEFINITIONS%0A%23+Command+definitions+that+this+daemon+will+run.++Definitions%0A%23+are+in+the+following+for
mat%3A%0A%23%0A%23+command%5B%3Ccommand_name%3E%5D%3D%3Ccommand_line%3E%0A%23%0A%23+When+the+daemon+receives+a+request+to+return+the+results+of+%3Ccommand_name%3E%0A%23+it+will+execute+the+command+spe
cified+by+the+%3Ccommand_line%3E+argument.%0A%23%0A%23+Unlike+Nagios%2C+the+command+line+cannot+contain+macros+-+it+must+be%0A%23+typed+exactly+as+it+should+be+executed.%0A%23%0A%23+Note%3A+Any+plugin
s+that+are+used+in+the+command+lines+must+reside%0A%23+on+the+machine+that+this+daemon+is+running+on%21++The+examples+below%0A%23+assume+that+you+have+plugins+installed+in+a+%2Fusr%2Flocal%2Fnagios%2F
libexec%0A%23+directory.++Also+note+that+you+will+have+to+modify+the+definitions+below%0A%23+to+match+the+argument+format+the+plugins+expect.++Remember%2C+these+are%0A%23+examples+only%21%0A%0A%23+The
+following+examples+use+hardcoded+command+arguments...%0A%0A%23command%5Bcheck_users%5D%3D%2Fusr%2Flib%2Fnagios%2Fplugins%2Fcheck_users+-w+5+-c+10%0A%23command%5Bcheck_load%5D%3D%2Fusr%2Flib%2Fnagios%
2Fplugins%2Fcheck_load+-w+15%2C10%2C5+-c+30%2C25%2C20%0A%23command%5Bcheck_disk1%5D%3D%2Fusr%2Flib%2Fnagios%2Fplugins%2Fcheck_disk+-w+20+-c+10+-p+%2Fdev%2Fhda1%0A%23command%5Bcheck_disk2%5D%3D%2Fusr%2
Flib%2Fnagios%2Fplugins%2Fcheck_disk+-w+20+-c+10+-p+%2Fdev%2Fhdb1%0A%23command%5Bcheck_zombie_procs%5D%3D%2Fusr%2Flib%2Fnagios%2Fplugins%2Fcheck_procs+-w+5+-c+10+-s+Z%0A%23command%5Bcheck_total_procs%
5D%3D%2Fusr%2Flib%2Fnagios%2Fplugins%2Fcheck_procs+-w+150+-c+200+%0A%0A%23+The+following+examples+allow+user-supplied+arguments+and+can%0A%23+only+be+used+if+the+NRPE+daemon+was+compiled+with+support+
for+%0A%23+command+arguments+%2AAND%2A+the+dont_blame_nrpe+directive+in+this%0A%23+config+file+is+set+to+%271%27...%0A%0A%23command%5Bcheck_users%5D%3D%2Fusr%2Flib%2Fnagios%2Fplugins%2Fcheck_users+-w+
%24ARG1%24+-c+%24ARG2%24%0Acommand%5Bcheck_load%5D%3D%2Fusr%2Flib%2Fnagios%2Fplugins%2Fcheck_load+-w+%24ARG1%24+-c+%24ARG2%24%0Acommand%5Bcheck_disk%5D%3D%2Fusr%2Flib%2Fnagios%2Fplugins%2Fcheck_d
Jan 17 21:32:47 puppet isk+-w+%24ARG1%24+-c+%24ARG2%24+-p+%24ARG3%24%0Acommand%5Bcheck_procs%5D%3D%2Fusr%2Flib%2Fnagios%2Fplugins%2Fcheck_procs+-w+%24ARG1%24+-c+%24ARG2%24+-C+%24ARG3%24%0Acommand%5Bch
eck_mailq%5D%3D%2Fusr%2Flocal%2Fnagios%2Flibexec%2Fcheck_mailq+-w+%24ARG1%24+-c+%24ARG2%24+-M+%24ARG3%24%0A%0A%23%0A%23+local+configuration%3A%0A%23%09if+you%27d+prefer%2C+you+can+instead+place+direct
ives+here%0Ainclude%3D%2Fetc%2Fnagios%2Fnrpe_local.cfg%0A
Jan 17 21:32:47 puppet puppetd[5768]: (//Node[puppet]/monitoring/File[/etc/nagios/nrpe.cfg]/source) replacing from source puppet:///files/nrpe.cfg with contents {md5}1a87ce50fb42e453636246f3e276fddb
Jan 17 21:32:47 puppet puppetd[5768]: (//Node[puppet]/monitoring/File[/etc/nagios/nrpe.cfg]) Scheduling refresh of Service[nagios-nrpe-server]
Jan 17 21:32:47 puppet puppetd[5768]: (//Node[puppet]/Git::Web::Export[puppet]/Exec[/usr/bin/git-update-server-info #/srv/git/puppet/.git]/returns) executed successfully
Jan 17 21:32:48 puppet puppetd[5768]: (//Node[puppet]/monitoring/Service[nagios-nrpe-server]) Triggering 'refresh' from 1 dependencies
Jan 17 21:32:48 puppet nrpe[20102]: Caught SIGTERM - shutting down...
Jan 17 21:32:48 puppet nrpe[20102]: Cannot remove pidfile '/var/run/nrpe.pid' - check your privileges.
Jan 17 21:32:48 puppet nrpe[20102]: Daemon shutdown
Jan 17 21:32:48 puppet nrpe[6336]: Starting up daemon
Jan 17 21:32:48 puppet nrpe[6336]: No variable value specified in config file '/etc/nagios/nrpe.cfg' - Line 1
Jan 17 21:32:48 puppet nrpe[6336]: Config file '/etc/nagios/nrpe.cfg' contained errors, bailing out...
Jan 17 21:32:48 puppet puppetd[5768]: (//Node[puppet]/puppetmaster/File[/etc/resolv.conf]/source) No specified sources exist
Jan 17 21:32:50 puppet puppetd[5768]: (//Node[puppet]/puppetmaster/File[/etc/vim/vimrc]) Filebucketed to main with sum %22+%23%23%23+WARNING+THIS+FILE+IS+CONTROLLED+BY+PUPPET%0A%22+%23%23%23%23+ANY+CH
ANGES+MADE+NOT+VIA+PUPPET+WILL+BE+OVERWRITTEN%0A%22+All+system-wide+defaults+are+set+in+%24VIMRUNTIME%2Fdebian.vim+%28usually+just%0A%22+%2Fusr%2Fshare%2Fvim%2Fvimcurrent%2Fdebian.vim%29+and+sourced+b
y+the+call+to+%3Aruntime%0A%22+you+can+find+below.++If+you+wish+to+change+any+of+those+settings%2C+you+should%0A%22+do+it+in+this+file+%28%2Fetc%2Fvim%2Fvimrc%29%2C+since+debian.vim+will+be+overwritte
n%0A%22+everytime+an+upgrade+of+the+vim+packages+is+performed.++It+is+recommended+to%0A%22+make+changes+after+sourcing+debian.vim+since+it+alters+the+value+of+the%0A%22+%27compatible%27+option.%0A%0A%
22+This+line+should+not+be+removed+as+it+ensures+that+various+options+are%0A%22+properly+set+to+work+with+the+Vim-related+packages+available+in+Debian.%0Aruntime%21+debian.vim%0A%0A%22+Uncomment+the+n
ext+line+to+make+Vim+more+Vi-compatible%0A%22+NOTE%3A+debian.vim+sets+%27nocompatible%27.++Setting+%27compatible%27+changes+numerous%0A%22+options%2C+so+any+other+options+should+be+set+AFTER+setting+%
27compatible%27.%0A%22set+compatible%0A%0A%22+Vim5+and+later+versions+support+syntax+highlighting.+Uncommenting+the+next%0A%22+line+enables+syntax+highlighting+by+default.%0Asyntax+on%0A%0A%22+If+usin
g+a+dark+background+within+the+editing+area+and+syntax+highlighting%0A%22+turn+on+this+option+as+well%0Aset+nocompatible%0Aset+background%3Ddark%0Aset+expandtab%0Aset+tabstop%3D3%0Aset+softtabstop%3D3
%0Aset+shiftwidth%3D3%0A%22+Uncomment+the+following+to+have+Vim+jump+to+the+last+position+when%0A%22+reopening+a+file%0A%22if+has%28%22autocmd%22%29%0A%22++au+BufReadPost+%2A+if+line%28%22%27%5C%22%22
%29+%3E+0+%26%26+line%28%22%27%5C%22%22%29+%3C%3D+line%28%22%24%22%29%0A%22++++%5C%7C+exe+%22normal+g%27%5C%22%22+%7C+endif%0A%22endif%0A%0A%22+Uncomment+the+following+to+have+Vim+load+indentation+rul
es+according+to+the%0A%22+detected+filetype.+Per+default+Debian+Vim+only+load+filetype+specific%0A%22+plugins.%0A%22if+has%28%22autocmd%22%29%0A%22++filetype+indent+on%0A%22endif%0A%0A%22+The+followin
g+are+commented+out+as+they+cause+vim+to+behave+a+lot%0A%22+differently+from+regular+Vi.+They+are+highly+recommended+though.%0A%22set+showcmd%09%09%22+Show+%28partial%29+command+in+status+line.%0A%22s
et+showmatch%09%09%22+Show+matching+brackets.%0A%22set+ignorecase%09%09%22+Do+case+insensitive+matching%0A%22set+smartcase%09%09%22+Do+smart+case+matching%0Aset+incsearch%09%09%22+Incremental+search%0
A%22set+autowrite%09%09%22+Automatically+save+before+commands+like+%3Anext+and+%3Amake%0A%22set+hidden+++++++++++++%22+Hide+buffers+when+they+are+abandoned%0A%22set+mouse%3Da%09%09%22+Enable+mouse+usa
ge+%28all+modes%29+in+terminals%0A%0A%22+Source+a+global+configuration+file+if+available%0A%22+XXX+Deprecated%2C+please+move+your+changes+here+in+%2Fetc%2Fvim%2Fvimrc%0Aif+filereadable%28%22%2Fetc%2Fv
im%2Fvimrc.local%22%29%0A++source+%2Fetc%2Fvim%2Fvimrc.local%0Aendif%0A%0A
Jan 17 21:32:50 puppet puppetd[5768]: (//Node[puppet]/puppetmaster/File[/etc/vim/vimrc]/source) replacing from source puppet:///files/vimrc.conf with contents {md5}d2ba56055fc65fd4515d48b50d00bdec
Jan 17 21:32:51 puppet puppetd[5768]: (//Node[puppet]/puppetmaster/File[/etc/screenrc]/source) No specified sources exist
Jan 17 21:32:51 puppet crontab[6353]: (root) LIST (root)
Jan 17 21:32:51 puppet puppetd[5768]: (//File[/var/lib/puppet/modules]/source) {:type=>"directory\n/README", :owner=>"file", :mode=>"/"}
Jan 17 21:32:51 puppet puppetd[5768]: (//File[/var/lib/puppet/modules]/source) Cannot use files of type directory /README as sources
Jan 17 21:32:51 puppet puppetd[5768]: (//File[/var/lib/puppet/modules/README]) Filebucketed to main with sum 493  directory   0  0  {md5}Thu Jan 17 13:59:15 +1300 2008
Jan 17 21:32:51 puppet puppetd[5768]: (//File[/var/lib/puppet/modules/README]/ensure) removed
Jan 17 21:32:54 puppet crontab[6356]: (root) LIST (root)
Jan 17 21:32:54 puppet puppetd[5768]: (//Node[puppet]/Git::Web::Export[asterisk]/Exec[/usr/bin/git-update-server-info #/srv/git/asterisk/.git]/returns) executed successfully
Jan 17 21:32:55 puppet puppetd[5768]: (//Node[puppet]/puppetmaster/File[/etc/apt/trusted.gpg]/source) No specified sources exist
Jan 17 21:32:55 puppet puppetd[5768]: (/Exec[/usr/bin/apt-get update]) Triggering 'refresh' from 2 dependencies
Jan 17 21:32:55 puppet puppetd[5768]: (/Exec[/usr/bin/apt-get update]) Failed to call refresh on Exec[/usr/bin/apt-get update]: /usr/bin/apt-get update returned 100 instead of 0 at /etc/puppet/manifes
ts/classes/systems.pp:107
+MADE+NOT+VIA+PUPPET+WILL+BE+OVERWRITTEN%0A%23%0A%23%23+%7E%2F.bashrc%3A+executed+by+bash%281%29+for+non-login+shells.%0A%23+see+%2Fusr%2Fshare%2Fdoc%2Fbash%2Fexamples%2Fstartup-files+%28in+the+packag
e+bash-doc%29%0A%23+for+examples%0A%0A%23+If+not+running+interactively%2C+don%27t+do+anything%0A%5B+-z+%22%24PS1%22+%5D+%26%26+return%0A%0A%23+don%27t+put+duplicate+lines+in+the+history.+See+bash%281%
29+for+more+options%0Aexport+HISTCONTROL%3Dignoredups%0A%0A%23+check+the+window+size+after+each+command+and%2C+if+necessary%2C%0A%23+update+the+values+of+LINES+and+COLUMNS.%0Ashopt+-s+checkwinsize%0A%
0A%23+make+less+more+friendly+for+non-text+input+files%2C+see+lesspipe%281%29%0A%5B+-x+%2Fusr%2Fbin%2Flesspipe+%5D+%26%26+eval+%22%24%28lesspipe%29%22%0A%0A%23+set+variable+identifying+the+chroot+you+
work+in+%28used+in+the+prompt+below%29%0Aif+%5B+-z+%22%24debian_chroot%22+%5D+%26%26+%5B+-r+%2Fetc%2Fdebian_chroot+%5D%3B+then%0A++++debian_chroot%3D%24%28cat+%2Fetc%2Fdebian_chroot%29%0Afi%0A%0A%23+C
omment+in+the+above+and+uncomment+this+below+for+a+color+prompt%0APS1%3D%27%24%7Bdebian_chroot%3A%2B%28%24debian_chroot%29%7D%5C%5B%5C033%5B01%3B32m%5C%5D%5Cu%40%5Ch%5C%5B%5C033%5B00m%5C%5D%3A%5C%5B%5
C033%5B01%3B34m%5C%5D%5Cw%5C%5B%5C033%5B00m%5C%5D%5C%24+%27%0A%0A%23+If+this+is+an+xterm+set+the+title+to+user%40host%3Adir%0Acase+%22%24TERM%22+in%0Axterm%2A%7Crxvt%2A%29%0A++++PROMPT_COMMAND%3D%27ec
ho+-ne+%22%5C033%5D0%3B%24%7BUSER%7D%40%24%7BHOSTNAME%7D%3A+%24%7BPWD%2F%24HOME%2F%7E%7D%5C007%22%27%0A++++%3B%3B%0A%2A%29%0A++++%3B%3B%0Aesac%0A%0A%23+Alias+definitions.%0A%23+You+may+want+to+put+all
+your+additions+into+a+separate+file+like%0A%23+%7E%2F.bash_aliases%2C+instead+of+adding+them+here+directly.%0A%23+See+%2Fusr%2Fshare%2Fdoc%2Fbash-doc%2Fexamples+in+the+bash-doc+package.%0A%0A%23if+%5
B+-f+%7E%2F.bash_aliases+%5D%3B+then%0A%23++++.+%7E%2F.bash_aliases%0A%23fi%0A%0A%23+enable+color+support+of+ls+and+also+add+handy+aliases%0Aif+%5B+%22%24TERM%22+%21%3D+%22dumb%22+%5D%3B+then%0A++++ev
al+%22%60dircolors+-b%60%22%0A++++alias+ls%3D%27ls+--color%3Dauto%27%0A++++%23alias+dir%3D%27ls+--color%3Dauto+--format%3Dvertical%27%0A++++%23alias+vdir%3D%27ls+--color%3Dauto+--format%3Dlong%27%0Afi
%0A%0A%23+some+more+ls+aliases%0A%23alias+ll%3D%27ls+-l%27%0A%23alias+la%3D%27ls+-A%27%0A%23alias+l%3D%27ls+-CF%27%0A%0A%23+enable+programmable+completion+features+%28you+don%27t+need+to+enable%0A%23+
this%2C+if+it%27s+already+enabled+in+%2Fetc%2Fbash.bashrc+and+%2Fetc%2Fprofile%0A%23+sources+%2Fetc%2Fbash.bashrc%29.%0Aif+%5B+-f+%2Fetc%2Fbash_completion+%5D%3B+then%0A++++.+%2Fetc%2Fbash_completion%
0Afi%0Aexport+LESS%3D%22--tabs%3D3%22%0Aexport+VISUAL%3Dvi%0A%0Acase+%24TERM+in%0A++xterm%2A%29%0A++PROMPT_COMMAND%3D%27echo+-ne+%22%5C033%5D0%3B%24%7BUSER%7D%40%24%7BHOSTNAME%7D%3A+%24%7BPWD%7D%5C007
%22%27%0A++%3B%3B%0A++screen%2A%29%0A++echo+-n+-e+%22%5C033k%60hostname%60%5C033%5C134%22%0A++%3B%3B%0A++%2A%29%0A++%3B%3B%0Aesac%0A%0Afunction+ssh%28%29+%7B%0A++++++++case+%24TERM+in%0A++++++++++++li
nux%2A%29%0A++++++++++++++++echo+-n+-e+%22%5C033k%241%5C033%5C134%22%0A++++++++++++++++%2Fusr%2Fbin%2Fssh+%24%40%0A++++++++++++++++echo+-n+-e+%22%5C033k%60hostname%60%5C033%5C134%22%0A++++++++++++%3B%
3B%0A++++++++++++%2A%29%0A++++++++++++++++%2Fusr%2Fbin%2Fssh+%24%40%0A++++++++++++%3B%3B%0A++++++++esac%0A%7D%0A%0Afunction+telnet%28%29+%7B%0A++++++++case+%24TERM+in%0A++++++++++++linux%2A%29%0A+++++
+++++++++++echo+-n+-e+%22%5C033k%241%5C033%5C134%22%0A++++++++++++++++%2Fusr%2Fbin%2Ftelnet+%24%40%0A++++++++++++++++echo+-n+-e+%22%5C033k%60hostname%60%5C033%5C134%22%0A++++++++++++%3B%3B%0A+++++++++
+++%2A%29%0A++++++++++++++++%2Fusr%2Fbin%2Ftelnet+%24%40%0A++++++++++++%3B%3B%0A++++++++esac%0A%7D%0Aexport+http_proxy%3Dhttp%3A%2F%2Fproxy.maxnet.net.nz%3A8080%0A%2Fusr%2Fbin%2Fmotd%0A
Jan 17 21:32:56 puppet puppetd[5768]: (//Node[puppet]/puppetmaster/File[/etc/skel/.bashrc]/source) replacing from source puppet:///files/bashrc.conf with contents {md5}62350842fae499561653fa11bbece8f8
Jan 17 21:32:58 puppet puppetd[5768]: (//Node[puppet]/puppetmaster/File[/etc/ntpd.conf]) Filebucketed to main with sum 420 file  0  0  {md5}96a135c700fcc357f95fd9a6deb063c3
Jan 17 21:32:58 puppet puppetd[5768]: (//Node[puppet]/puppetmaster/File[/etc/ntpd.conf]/ensure) ensure changed 'file' to 'directory'
Jan 17 21:32:58 puppet puppetd[5768]: (//Node[puppet]/puppetmaster/File[/etc/puppet/puppet.conf]/source) No specified sources exist
Jan 17 21:32:58 puppet puppetd[5768]: (//Node[puppet]/puppetmaster/File[/etc/wgetrc]) Filebucketed to main with sum %23%23%23+WARNING+THIS+FILE+IS+CONTROLLED+BY+PUPPET%0A%23%23%23%23+ANY+CHANGES+MADE+
NOT+VIA+PUPPET+WILL+BE+OVERWRITTEN%0A%0A%23%23%23%0A%23%23%23+Sample+Wget+initialization+file+.wgetrc%0A%23%23%23%0A%0A%23%23+You+can+use+this+file+to+change+the+default+behaviour+of+wget+or+to%0A%23%
23+avoid+having+to+type+many+many+command-line+options.+This+file+does%0A%23%23+not+contain+a+comprehensive+list+of+commands+--+look+at+the+manual%0A%23%23+to+find+out+what+you+can+put+into+this+file.
%0A%23%23+%0A%23%23+Wget+initialization+file+can+reside+in+%2Fetc%2Fwgetrc%0A%23%23+%28global%2C+for+all+users%29+or+%24HOME%2F.wgetrc+%28for+a+single+user%29.%0A%23%23%0A%23%23+To+use+the+settings+in
+this+file%2C+you+will+have+to+uncomment+them%2C%0A%23%23+as+well+as+change+them%2C+in+most+cases%2C+as+the+values+on+the%0A%23%23+commented-out+lines+are+the+default+values+%28e.g.+%22off%22%29.%0A%0
A%0A%23%23%0A%23%23+Global+settings+%28useful+for+setting+up+in+%2Fetc%2Fwgetrc%29.%0A%23%23+Think+well+before+you+change+them%2C+since+they+may+reduce+wget%27s%0A%23%23+functionality%2C+and+make+it+b
ehave+contrary+to+the+documentation%3A%0A%23%23%0A%0A%23+You+can+set+retrieve+quota+for+beginners+by+specifying+a+value%0A%23+optionally+followed+by+%27K%27+%28kilobytes%29+or+%27M%27+%28megabytes%29.
++The%0A%23+default+quota+is+unlimited.%0A%23quota+%3D+inf%0A%0A%23+You+can+lower+%28or+raise%29+the+default+number+of+retries+when%0A%23+downloading+a+file+%28default+is+20%29.%0A%23tries+%3D+20%0A%0
A%23+Lowering+the+maximum+depth+of+the+recursive+retrieval+is+handy+to%0A%23+prevent+newbies+from+going+too+%22deep%22+when+they+unwittingly+start%0A%23+the+recursive+retrieval.++The+default+is+5.%0A%
23reclevel+%3D+5%0A%0A%23+By+default+Wget+uses+%22passive+FTP%22+transfer+where+the+client%0A%23+initiates+the+data+connection+to+the+server+rather+than+the+other%0A%23+way+around.++That+is+required+o
n+systems+behind+NAT+where+the+client%0A%23+computer+cannot+be+easily+reached+from+the+Internet.++However%2C+some%0A%23+firewalls+software+explicitly+supports+active+FTP+and+in+fact+has%0A%23+problems
+supporting+passive+transfer.++If+you+are+in+such%0A%23+environment%2C+use+%22passive_ftp+%3D+off%22+to+revert+to+active+FTP.%0A%23passive_ftp+%3D+off%0Apassive_ftp+%3D+on%0A%0A%23+The+%22wait%22+comm
and+below+makes+Wget+wait+between+every+connection.%0A%23+If%2C+instead%2C+you+want+Wget+to+wait+only+between+retries+of+failed%0A%23+downloads%2C+set+waitretry+to+maximum+number+of+seconds+to+wait+%2
8Wget%0A%23+will+use+%22linear+backoff%22%2C+waiting+1+second+after+the+first+failure%0A%23+on+a+file%2C+2+seconds+after+the+second+failure%2C+etc.+up+to+this+max%29.%0Awaitretry+%3D+10%0A%0A%0A%23%23
%0A%23%23+Local+settings+%28for+a+user+to+set+in+his+%24HOME%2F.wgetrc%29.++It+is%0A%23%23+%2Ahighly%2A+undesirable+to+put+these+settings+in+the+global+file%2C+since%0A%23%23+they+are+potentially+dang
erous+to+%22normal%22+users.%0A%23%23%0A%23%23+Even+when+setting+up+your+own+%7E%2F.wgetrc%2C+you+should+know+what+you%0A%23%23+are+doing+before+doing+so.%0A%23%23%0A%0A%23+Set+this+to+on+to+use+times
tamping+by+default%3A%0A%23timestamping+%3D+off%0A%0A%23+It+is+a+good+idea+to+make+Wget+send+your+email+address+in+a+%60From%3A%27%0A%23+header+with+your+request+%28so+that+server+administrators+can+c
ontact%0A%23+you+in+case+of+errors%29.++Wget+does+%2Anot%2A+send+%60From%3A%27+by+default.%0A%23header+%3D+From%3A+Your+Name+%3Cusername%40site.domain%3E%0A%0A%23+You+can+set+up+other+headers%2C+like+
Accept-Language.++Accept-Language%0A%23+is+%2Anot%2A+sent+by+default.%0A%23header+%3D+Accept-Language%3A+en%0A%0A%23+You+can+set+the+default+proxies+for+Wget+to+use+for+http+and+ftp.%0A%23+They+will+o
verride+the+value+in+the+environment.%0Ahttp_proxy+%3D+http%3A%2F%2F123.100.71.10%3A8080%2F%0Aftp_proxy+%3D+http%3A%2F%2F123.100.71.10%3A8080%2F%0A%0A%23+If+you+do+not+want+to+use+proxy+at+all%2C+set+
this+to+off.%0Ause_proxy+%3D+on%0A%0A%23+You+can+customize+the+retrieval+outlook.++Valid+options+are+default%2C%0A%23+binary%2C+mega+and+micro.%0A%23dot_style+%3D+default%0A%0A%23+Setting+this+to+off+
makes+Wget+not+download+%2Frobots.txt.++Be+sure+to%0A%23+know+%2Aexactly%2A+what+%2Frobots.txt+is+and+how+it+is+used+before+changing%0A%23+the+default%21%0A%23robots+%3D+on%0A%0A%23+It+can+be+useful+t
o+make+Wget+wait+between+connections.++Set+this+to%0A%23+the+number+of+seconds+you+want+Wget+to+wait.%0A%23wait+%3D+0%0A%0A%23+You+can+force+creating+directory+structure%2C+even+if+a+single+is+being%0
A%23+retrieved%2C+by+setting+this+to+on.%0A%23dirstruct+%3D+off%0A%0A%23+You+can+turn+on+recursive+retrieving+by+default+%28don%27t+do+this+if%0A%23+you+are+not+sure+you+know+what+it+means%29+by+setti
ng+this+to+on.%0A%23recursive+%3D+off%0A%0A%23+To+always+back+up+file+X+as+X.orig+before+converting+its+links+%28due%0A%23+to+-k+%2F+--convert-links+%2F+convert_links+%3D+on+having+been+specified%29%2
C%0A%23+set+this+variable+to+on%3A%0A%23backup_converted+%3D+off%0A%0A%23+To+have+Wget+follow+FTP+links+from+HTML+files+by+default%2C+set+this%0A%23+to+on%3A%0A%23follow_ftp+%3D+off%0A
Jan 17 21:32:58 puppet puppetd[5768]: (//Node[puppet]/puppetmaster/File[/etc/wgetrc]/source) replacing from source puppet:///files/wgetrc with contents {md5}6187b00ec33af456869ce2f32cdae560
Jan 17 21:32:59 puppet puppetd[5768]: Finished catalog run in 13.71 seconds

So, here's the skinny: puppetd while talking to puppetmasterd (localhost) while puppetmasterd had been leaking memory for some hours managed to corrupt a fair proportion of file{} resources. On other nodes other files were corrupted, for example the /etc/pam.d stuff in my LDAP auth class. Pretty major. RAM usage of Puppetmasterd was at 275MB, not as high as I've seen it before, but still high.

Sent a slightly surly angry to Lak about paying for it to get fixed - here's hoping I don't have to resort to that.

Change History

01/28/08 04:30:52 changed by luke

  • priority changed from normal to high.
  • severity changed from normal to major.
  • milestone set to 0.24.2.

I expect the only real solution here is to verify the md5 sum on the client side, to make sure the file being sent is the same file we asked for.

01/28/08 04:31:03 changed by luke

  • stage changed from Unreviewed to Accepted.

02/07/08 23:49:19 changed by Fujin

Found an IRC log containing the contents of one of the files:

[2008/01/11 22:55:04] <fujin> where it grows to ridiculous ram usage
[2008/01/11 22:55:13] <fujin> and starts overwriting file{} resources
[2008/01/11 22:55:28] <fujin> < 420 file 0 0 {md5}55e72f92f1075842e36dfe449fb9edae
[2008/01/11 22:55:30] <fujin> with stuff like that

02/16/08 23:05:50 changed by rra

We just saw the same thing. In addition to replacing some files with those hash values, it also replaced some files with empty directories (/etc/crontab on one system, /etc/resolv.conf on another system -- that caused interesting problems).

(follow-up: ↓ 6 ) 02/21/08 11:31:19 changed by tim

We luckily killed puppetmaster before it could corrupt files, because we get munin errors before it goes really bad. It this time, we resorted to not permanently running puppetmaster, but only run it and puppetd when there are changes made. I have no idea how to go about debugging this... Any tips?

(in reply to: ↑ 5 ) 02/21/08 11:47:02 changed by immerda

We were having as well the same problem. authorized_keys file was filled with this sort of md5-hash stuff.

02/22/08 05:20:38 changed by luke

  • status changed from new to closed.
  • resolution set to fixed.

I haven't been able to find the source of these issues, but I'm hoping the fix in [b06767ee2d7c22c27d746d3e8d1b6effa37deaa6] will be close enough.

This commit adds a client-side checksum verification, so that the content written to disk is the same as the content on the server, which should at least get you failures instead of bad data written to disk.

03/14/08 01:01:13 changed by adamhjk

  • priority changed from high to highest.
  • version changed from 0.24.1 to 0.24.2.
  • severity changed from major to critical.
  • milestone changed from 0.24.2 to 0.24.3.

See #1131. There is at least one report of similar conditions, where the file resource gets corrupted into creating a directory that appears to be slipping past the checksum fix. I have trouble imagining a more critical bug in puppet.

03/14/08 01:27:37 changed by luke

  • status changed from closed to reopened.
  • resolution deleted.

Is anyone in a position to help debug the problem? Can anyone reproduce it and give me access to the system? Can anyone give me traces from the server and/or client when there are problems?

03/14/08 05:00:45 changed by adamhjk

I can provide an EC2 node running a copy of the manifests that we know trigger the bug, if it would be helpful.

03/14/08 08:15:24 changed by Fujin

I can provice a ubuntu 7.04 system w/ 0.24.1-2 from Debian which exhibits this behaviour (when puppetmaster isn't restarted regularly) if an NDA is signed.

03/16/08 22:24:50 changed by luke

  • milestone changed from 0.24.3 to elmo.

This problem is clearly triggered by something specific on the server, and it involves significant confusion on the server. Until I can trigger that confusion on demand, I can't reproduce the problem.

I've already spent a couple of days doing what I could to shore up the situation, and I've decided that it's unreasonable for me to spend my time on someone else's servers trying further to reproduce the problem. Based on the days I've already spent, I can't imagine that I'd fix the problem in less than 3-5 days, especially since it would require me learning the local environment.

I've provided all the resources you need in #1131 to reproduce the problem on a test system. If you're a customer, we already have a relationship and I clearly am expected to help you. If you're not a customer, you've decided that you're comfortable with community support and my own best-effort; well, I've already put out my best effort in this case, and it's time for someone else to provide some help.

03/18/08 11:15:33 changed by adamhjk

Okay!

There is some more data needed for tracking down this bug. We have thought for a while now that 1010 and #1131 are related, since we tend to see the corruption happen as memory consumption increases. Several times today, while spending a lot of time on Fujin's magically repeatable system, we saw a few things that seem to contradict this hypothesis:

1. We saw 1010 events happen when memory consumption was relatively low, and nowhere near the large scale spikes we sometimes see. 1. We saw objects in memory that seemed to be displaying quite a bit of growth, but they weren't obviously related to the file objects at all. 1. The objects most reported to be leaking had nothing to do with file resources.

At this point, having spent a lot of time poking around inside a running puppet master exhibiting 1010 and #1131, my working hypothesis is that 1010 and #1031 are *not* related.

What we need to do now is prove it, and anyone with a 1010 system can help.

We need you to figure out which resource is affected by 1010. The best way to get this is to grab the debug/trace output that shows the bad comparison. We'll be happy with just a File/i/got/hit/by/1010? style resource designation.

Next, we need you to go over to #1131, and follow the instructions to get gdb running against your puppetmaster.

Once you have it running, and you've redirected_stdout, we need you to run:

(gdb) eval "total = ObjectSpace.each_object {|x| if x.class.to_s =~ /Puppet/; puts '---'; puts x.inspect; end }; puts \"---\nTotal Objects: #{total}\""

This is going to take a very long time to complete, and generate a really, really large amount of data. What we're looking for is the proverbial needle in the haystack -- we want to find the corrupted file resource.

If we can't find it, that means it's not being persisted within a single puppetmasterd process. Which raises the question: if it's not being persisted, and it's related to the leak, why do we garbage collect it *after it's corrupted by the leak*?

This bug is starting to smell more and more like thread saftey to me, but I would like to get the leak relationship proven or disproven first.

Once you have that dump, send it to adam at hjksolutions.com. It'll be in /tmp/ruby-debug.PID, where PID is the PID of your running puppet master.

Adam

(follow-up: ↓ 15 ) 03/20/08 09:44:57 changed by adamhjk

  • owner changed from community to luke.
  • status changed from reopened to new.
  • stage changed from Accepted to Needs design decision.

After much debugging and hypothesizing, lak, fujin, shadoi and I have figured this out.

Shadoi had an incident today where an 0.23.2 client was accidentally upgraded to 0.24.1. This client then 1010'ed, but the 0.23.2 puppet master showed no signs of either memory leaking or of spreading the 1010 activity beyond this single node.

That started a conversation with Luke around where in the client code things could be at fault. He pointed out the http pooling code in network/http_pool.rb, and suggested we eliminate the caching altogether. I put together a patch, Fujin built and distributed it to his clients, and his otherwise very reliable 1010 causing infrastructure has not seen a recurrence since. (With load essentially identical to conditions we could trigger this bug in regularly.)

The patch is at:

git://junglist.gen.nz/puppet.upstream bug1010

Now, a few caveats:

1. This causes lots of "Other end went away" errors on the client. This appears to be lines 68-75 of network/xmlrpc/client.rb. I'm pretty sure that it can just be silenced. The best thing to do is explicitly tear down the HTTP connection, but I'm not sure where that should be done. I imagine lak is the man to carry this over the finish line in that regard.

2. The memory leaking being tracked in #1131 is greatly exacerbated by this issue. We saw much faster growth than we had previously, and when we finally killed the puppetmaster it was at nearly 1.5GB of VSS. If you are being plagued by #1010, and you don't want to wait for some fixes to #1131, make sure you stick a cronjob in there to restart your mongrels before they eat you alive.

Fujin should get special thanks here -- if it wasn't for him, and his willingness to allow us to trample all over his puppetmaster, this bug would have lingered even longer.

Shadoi, gets the "good eye" award. Thanks for pointing us in the right direction.

As always, Luke's guidance was the key to putting a fix in place.

See you at the next 0.24.x release with a no longer occurring #1010 once we fix the chattering.

(in reply to: ↑ 14 ; follow-up: ↓ 16 ) 03/20/08 11:38:33 changed by bart

Replying to adamhjk:

After much debugging and hypothesizing, lak, fujin, shadoi and I have figured this out.

Great work!

That started a conversation with Luke around where in the client code things could be at fault. He pointed out the http pooling code in network/http_pool.rb, and suggested we eliminate the caching altogether.

Is everyone who is seeing this problem using Mongrel? We're using Webrick and haven't seen any file corruption. We are seeing the memory leak from #1131 though.

(in reply to: ↑ 15 ) 03/20/08 13:43:12 changed by immerda

Is everyone who is seeing this problem using Mongrel? We're using Webrick and haven't seen any file corruption. We are seeing the memory leak from #1131 though.

we are seeing issues with #1010 (#1131 not yet observed) and are using Webrick.

03/21/08 05:34:21 changed by luke

  • milestone changed from elmo to 0.24.4.

At this point, my plan is to provide an option to enable http pooling, leaving it off by default. This will result in slowness, but we all would obviously rather be slow than have corrupt files.

03/21/08 18:17:52 changed by adamhjk

Sounds good to me.

03/24/08 15:58:06 changed by luke

  • status changed from new to closed.
  • resolution set to fixed.

I've disabled keep-alive in [273c7ec]. Note that this will result in a speed hit, but that's better than file corruption.

If you want to re-enable it, just edit the network/http_pool.rb file. We decided not to make it a run-time option, given its danger.

I'd still love to know why this corruption is happening, such that at a later date we could re-enable it.