Login
User Name:

Password:



Register
Forgot your password?
Vote for Us!
Couple bugs
Dec 12, 2017, 5:42 pm
By Remcon
Bug in disarm( )
Nov 12, 2017, 6:54 pm
By GatewaySysop
Bug in will_fall( )
Oct 23, 2017, 1:35 am
By GatewaySysop
Bug in do_zap( ), do_brandish( )
Oct 18, 2017, 1:52 pm
By GatewaySysop
Bug in get_exp_worth( )
Oct 10, 2017, 1:26 am
By GatewaySysop
LOP 1.45
Author: Remcon
Submitted by: Remcon
LOP Heroes Edition
Author: Vladaar
Submitted by: Vladaar
Heroes sound extras
Author: Vladaar
Submitted by: Vladaar
6Dragons 4.3
Author: Vladaar
Submitted by: Vladaar
Memwatch
Author: Johan Lindh
Submitted by: Vladaar
Users Online
CommonCrawl, DotBot

Members: 0
Guests: 12
Stats
Files
Topics
Posts
Members
Newest Member
477
3,705
19,232
608
LAntorcha
Today's Birthdays
There are no member birthdays today.
Related Links
» SmaugMuds.org » Codebases » SWR FUSS » Hotboot problems
Forum Rules | Mark all | Recent Posts

Hotboot problems
< Newer Topic :: Older Topic >

Pages:<< prev 1 next >>
Post is unread #1 Apr 26, 2008, 11:07 am
Go to the top of the page
Go to the bottom of the page

Banner
Magician
GroupMembers
Posts169
JoinedNov 29, 2005

I'm still having the problem with random crashes I mentioned in my other post. The MUD crashes with this error randomly when a player leaves or is disconnected:

Wed Apr 23 16:44:16 2008 :: Saving roche_asteroids.are...
Wed Apr 23 16:44:16 2008 :: Saving Ryloth.are...
Wed Apr 23 16:44:16 2008 :: Saving pships.are...
Wed Apr 23 16:45:24 2008 :: Log Luorei: bank balance
Wed Apr 23 16:45:29 2008 :: Log Luorei: bank deposit 60000
Wed Apr 23 16:45:57 2008 :: Yurik has quit.
accept_new: select: poll: Bad file descriptor


It worked before, and all I did was add in hotboot. I think it has something to do with the 'control' value in comm.c, as that's the only call accept_new makes on any variable. I seen in a modified SWR that control was closed like this:

   log_string( "Booting Database" );
   boot_db( fCopyOver );
   log_string( "Initializing socket" );
   if( !fCopyOver )  /* We have already the port if copyover'ed */
   {
      control = init_socket( port );
 //     control2 = init_socket( port + 1 );
   }
   sprintf( log_buf, "&wStar Wars: Galactic Insights v%s.%s is ready to rock and roll on port %d.&W", MUD_VERSION_MAJOR,
            MUD_VERSION_MINOR, port );
   log_string( log_buf );
   if( fCopyOver )
   {
      log_string( "Initiating hotboot recovery." );
      hotboot_recover(  );
   }
   game_loop(  ); 
----> THIS LINE   close(control);

#ifdef IMC
   imc_shutdown( FALSE );
#endif
#ifdef I3
   I3_shutdown( 0 );
#endif


It still crashes after I added that fix, but certainly not as often.

My other problem is that hotboot is saving items inside of corpses. Someone looted a corpse, a hotboot was done, and the items appeared back inside the corpse and on the player. I added the fixes to fread_obj so that it shouldn't save corpses, but it apprently still is. Any help?

       
Post is unread #2 Apr 26, 2008, 1:46 pm
Go to the top of the page
Go to the bottom of the page

David Haley
Sorcerer
GroupMembers
Posts903
JoinedJan 29, 2007

You should use gdb to figure out where it's crashing. If you're not getting core dumps, you should figure out how to turn them on. See e.g. Nick Gammon's gdb giude for how to enable large core dumps.

with the 'control' value in comm.c, as that's the only call accept_new makes on any variable

What does this mean, more precisely?

Well, anyhow, debugging this without gdb is kind of like looking for a needle in a haystack... I've already said what is happening: a descriptor is being closed that shouldn't be, and is being added to the list of descriptors to check at network update. Why or where that happens will be very hard to tell without gdb...

You should compare your implementation of the hotboot snippet with the actual snippet to make sure that you did everything exactly as you were supposed to. Try doing it with a stock version of your codebase and see if you can get it to work there.
       
Post is unread #3 Apr 27, 2008, 9:04 am
Go to the top of the page
Go to the bottom of the page

Banner
Magician
GroupMembers
Posts169
JoinedNov 29, 2005

I got it to crash in gdb, but there was no stack.

Sun Apr 27 10:03:41 2008 :: Loading Hall of Fame
Sun Apr 27 10:03:41 2008 :: Resetting variables
Sun Apr 27 10:03:41 2008 :: Initializing socket
Sun Apr 27 10:03:41 2008 :: &wStar Wars: Galactic Insights v2.3 is ready to rock and roll on port 8062.&W
Sun Apr 27 10:03:41 2008 :: Initiating hotboot recovery.
Sun Apr 27 10:03:41 2008 :: Loading player data for: Banner (21K)
Sun Apr 27 10:03:41 2008 :: Updating area entry for Banner
Sun Apr 27 10:03:41 2008 :: Loading player data for: Newtest (1K)
Sun Apr 27 10:03:41 2008 :: Loading player data for: Oldtest (1K)
Sun Apr 27 10:03:41 2008 :: Hotboot recovery complete.
Sun Apr 27 10:03:41 2008 :: Registering SIGSEGV handler
Sun Apr 27 10:03:42 2008 :: Updating Webserver Information...

accept_new: select: poll: Bad file descriptor

Program exited with code 01.
(gdb) frame
No stack.
(gdb) bt
No stack.
(gdb) 
       
Post is unread #4 Apr 27, 2008, 12:56 pm
Go to the top of the page
Go to the bottom of the page

David Haley
Sorcerer
GroupMembers
Posts903
JoinedJan 29, 2007

Set a breakpoint at the line where it exits, wait for it to crash, and then examine the stack. You should be able to find it by searching for the string "poll" in comm.c -- you're looking for a call to perror, I believe.
       
Post is unread #5 Apr 27, 2008, 4:48 pm
Go to the top of the page
Go to the bottom of the page

Quixadhal
Conjurer
GroupMembers
Posts398
JoinedMar 8, 2005

Banner said:

   log_string( "Booting Database" );
   boot_db( fCopyOver );
   log_string( "Initializing socket" );
   if( !fCopyOver )  /* We have already the port if copyover'ed */
   {
      control = init_socket( port );
 //     control2 = init_socket( port + 1 );
   }
   sprintf( log_buf, "&wStar Wars: Galactic Insights v%s.%s is ready to rock and roll on port %d.&W", MUD_VERSION_MAJOR,
            MUD_VERSION_MINOR, port );
   log_string( log_buf );
   if( fCopyOver )
   {
      log_string( "Initiating hotboot recovery." );
      hotboot_recover(  );
   }
   game_loop(  ); 
----> THIS LINE   close(control);

#ifdef IMC
   imc_shutdown( FALSE );
#endif
#ifdef I3
   I3_shutdown( 0 );
#endif


It still crashes after I added that fix, but certainly not as often.


control is assigned about 13 lines up. It's probably the listening socket for the mud, which will be checked using either select() or poll() for incoming connections.

If that socket has already been closed, and you try to close it again, the universe will be unhappy. Likewise, if you try to close it and it hasn't yet been opened, life will be bad. C is annoying that way, the paranoid among us tend to check almost everything possible before accessing a pointer.

Banner said:

My other problem is that hotboot is saving items inside of corpses. Someone looted a corpse, a hotboot was done, and the items appeared back inside the corpse and on the player. I added the fixes to fread_obj so that it shouldn't save corpses, but it apprently still is. Any help?


That's a classic exploit which happens when you don't have atomic code. You go to move an object from the corpse to your inventiry. The code copies the object to you, then deletes it from the corpse. If it crashes between the two operations (and the resulting hotboot code saves both inventories and reboots), you end up with two copies.

Some of your choices are:
1. Live with item duplication bugs and try to prevent things that cause crashes.
2. Reverse the order so it deletes first and then copies... that would result in item loss, rather than duplication.
3. Devise a persistant scheme for doing item transfers. If you used a database, you'd wrap both operations with a transaction, which would rollback on a crash. Not using a database, you could open a file that describes the transaction in progress so if you crash before it completes, the recovery code on the other end of the hotboot could look for such files and pick up where it left off.

That's one of the reasons I prefer to let the driver crash. If you engineer it so things bounce themselves, then you don't have as much incentive to actually FIX the bugs. If your players have been sending you email for 6 hours because their game is sitting in gdb, frozen on a seg-fault, I (at least) am more likely to sit down and fix the cause of the problem, rather than slapping a band-aid on and letting it run again. :)
       
Post is unread #6 Apr 27, 2008, 6:11 pm   Last edited Apr 27, 2008, 8:07 pm by Banner
Go to the top of the page
Go to the bottom of the page

Banner
Magician
GroupMembers
Posts169
JoinedNov 29, 2005

Quixadhal said:

That's a classic exploit which happens when you don't have atomic code. You go to move an object from the corpse to your inventiry. The code copies the object to you, then deletes it from the corpse. If it crashes between the two operations (and the resulting hotboot code saves both inventories and reboots), you end up with two copies.


No no, someone looted the corpse, I did a hotboot, and the items respawned in both places. I didn't say anything crashed or an emergency hotboot happened..

Quixadhal said:

control is assigned about 13 lines up. It's probably the listening socket for the mud, which will be checked using either select() or poll() for incoming connections.

If that socket has already been closed, and you try to close it again, the universe will be unhappy. Likewise, if you try to close it and it hasn't yet been opened, life will be bad. C is annoying that way, the paranoid among us tend to check almost everything possible before accessing a pointer.


So what exactly are you saying I should do?
       
Post is unread #7 Apr 28, 2008, 2:21 am
Go to the top of the page
Go to the bottom of the page

Samson
Black Hand
GroupAdministrators
Posts3,639
JoinedJan 1, 2002

Heh. I'd tell you my solution but you explicitly said you didn't want to hear it already :)

Welcome to the hell of running dangerous code.
       
Post is unread #8 Apr 28, 2008, 9:11 am
Go to the top of the page
Go to the bottom of the page

David Haley
Sorcerer
GroupMembers
Posts903
JoinedJan 29, 2007

Banner said:

So what exactly are you saying I should do?

a little bird who flew by said:

Set a breakpoint at the line where it exits, wait for it to crash, and then examine the stack.
       
Post is unread #9 Apr 28, 2008, 11:58 am
Go to the top of the page
Go to the bottom of the page

Banner
Magician
GroupMembers
Posts169
JoinedNov 29, 2005


Samson said:

Heh. I'd tell you my solution but you explicitly said you didn't want to hear it already :)

Welcome to the hell of running dangerous code.


What exactly does that even mean? I never said anywhere I didn't want to hear anything. And if your comment is referring to emergency copyover, that's not the case. It works fine since I fixed it for hotboot, and it worked fine on copyover. The error I'm receiving has nothing to do with it, and if it does, then tell me so I can fix it instead of giving me cryptic messages. I don't see how switching to hotboot suddenly creates a new descriptor error that involves emergency copyover. :)
       
Post is unread #10 Apr 28, 2008, 12:01 pm
Go to the top of the page
Go to the bottom of the page

Banner
Magician
GroupMembers
Posts169
JoinedNov 29, 2005


DavidHaley said:

Banner said:

So what exactly are you saying I should do?

a little bird who flew by said:

Set a breakpoint at the line where it exits, wait for it to crash, and then examine the stack.


I was asking Quixadhl what he thought I should do. Your message showed up fine on my screen. Did you not see it on yours or something that you had to double post?
       
Post is unread #11 Apr 28, 2008, 12:11 pm
Go to the top of the page
Go to the bottom of the page

David Haley
Sorcerer
GroupMembers
Posts903
JoinedJan 29, 2007

I dunno, maybe I'm pointing out the fact that you have been given the answer many times already and for some reason you seem unwilling to go forward with it or ask for clarifications on how to set breakpoints etc. Instead, you repeat your question over and over again, starting new threads too. :smile:

I guess that I'm not sure how to help you anymore. You don't seem willing to do what you need to do to solve your problem. :shrug:
       
Post is unread #12 Apr 28, 2008, 2:31 pm
Go to the top of the page
Go to the bottom of the page

Quixadhal
Conjurer
GroupMembers
Posts398
JoinedMar 8, 2005

Banner said:

how switching to hotboot suddenly creates a new descriptor error that involves emergency copyover.


And that is indeed the question.

As much as we like to think that snippets of code are nice plug-and-play blocks you can put in and pull out at will, the DikuMUD codebase was never designed to be that modular, and the C programming language does nothing to make it any simpler to be modular. If you don't understand what the code does, it's twice as hard to figure out where it's going wrong.

My suggestion is to read through all the code that's involved in accepting connections, in closing them, and in performing the actual copyover, both setting it up before the exec, and recovering afterwards. You should also know exactly how signal handlers interact with each of these cases. Man pages are your friends.

If you understand the flow of the code, you'll know exactly where to set breakpoints and what values you are interested in observing when those breakpoints are hit. Otherwise, you're just throwing darts at a dartboard with the lights turned off.
       
Post is unread #13 Apr 28, 2008, 2:34 pm
Go to the top of the page
Go to the bottom of the page

David Haley
Sorcerer
GroupMembers
Posts903
JoinedJan 29, 2007

The thing is that it would be extremely easy to get a start here if only we knew which descriptors were being put into the FD_SETs ... and the best way to do that is to have breakpoints at the point where the error message comes up, and examine which descriptors are being polled ... Otherwise, this is all just an exercise in futility. Needle and haystack, darts and darkness, whichever metaphor you like they all apply. :smile:
       
Post is unread #14 Apr 28, 2008, 6:16 pm
Go to the top of the page
Go to the bottom of the page

Banner
Magician
GroupMembers
Posts169
JoinedNov 29, 2005

Sorry, DavidHaley. I looked at the guide on GDB that was linked to and it tells how to set breakpoints, so I didn't need to ask. I was given the answer once, not "many times". And I was clarifiying by asking for a confirmation. I haven't gotten around to trying it yet between school and work, but when I do, I'll be sure to call you up and we can calm these waters. I'm unsure as to why you're so hostile anyway.

Quixadhl, thanks for the nice-mannered explanation. I'll try what's been suggested and report back. Much thanks.
       
Post is unread #15 Apr 28, 2008, 6:51 pm   Last edited Apr 28, 2008, 6:51 pm by David Haley
Go to the top of the page
Go to the bottom of the page

David Haley
Sorcerer
GroupMembers
Posts903
JoinedJan 29, 2007

Actually, I did suggest that you use gdb several times, and you were not very responsive. I suggested it one last time above, and you said nothing at all, instead asking another person, yet again, what you should do to solve your problem. I don't understand why you ask questions, are given suggestions, and then you neither follow up on those suggestions nor explain why they don't work for you, just to come back and ask questions again. If you didn't have time to try it, you could have just said so, but completely ignoring it was rather rude on your part given that I am spending time to help you.

Anyhow, it is pointless to continue until we have that extra information, so I guess all we can do now is wait for that. :smile:


EDIT: while I agree that I am frustrated, I don't see any hostility on my part. If you felt that repeating myself was such a hostile act, I apologize, but then again to me it looked like you were completely ignoring the suggestion, or perhaps simply missed it.
       
Post is unread #16 Apr 28, 2008, 8:04 pm
Go to the top of the page
Go to the bottom of the page

Samson
Black Hand
GroupAdministrators
Posts3,639
JoinedJan 1, 2002

Banner said:


Samson said:

Heh. I'd tell you my solution but you explicitly said you didn't want to hear it already :)

Welcome to the hell of running dangerous code.


What exactly does that even mean? I never said anywhere I didn't want to hear anything. And if your comment is referring to emergency copyover, that's not the case. It works fine since I fixed it for hotboot, and it worked fine on copyover. The error I'm receiving has nothing to do with it, and if it does, then tell me so I can fix it instead of giving me cryptic messages. I don't see how switching to hotboot suddenly creates a new descriptor error that involves emergency copyover. :)


Well you said it with the emergency copyover thing. I suppose I just saw this thread and thought it was still the same topic. Personally I can't see how you'd hold that in isolation from switching "copyover" to "hotboot" ( the same thing! ) since that emergency handler is still going to invoke exactly the same process. You already know where I stand on such hackery in code and that I've been very active in discouraging people from using such things. But you said you didn't want that type of answer on the subject. Thing is, what you're running into could well be related, and, well, I can't help the feeling that the dangers involved have been ignored.
       
Post is unread #17 Apr 29, 2008, 3:50 pm
Go to the top of the page
Go to the bottom of the page

Keberus
Conjurer
GroupFUSS Project Team
Posts341
JoinedJun 4, 2005

Just kinda figured I would go ahead and add my 1/2 cent about this thing. We used to use the emergency copyover code in our mud, and yes for most things it worked out okay, and just copied over through the segmentation violation, or seg fault. But in the instance where you are trying to save characters/objects/rooms or any file for that matter, and then load from them, its possible that the saving gets interrupted then that causes a continous crash/emergency copyover basically resulting in an infinite loop. The other problem was when the error was in a file thats loaded on startup, since not everything was loaded but then we tried saving things, we ran into many, many issues. These things happened to us at least a dozen times in the year and a half that we used the emergency copyover code. If this doesn't deter you, then I suggest you put in a few "failsafes" one being a check to make sure everything has been loaded before bothering to "resave" everything before the emergency copyover is called.
       
Post is unread #18 Apr 29, 2008, 4:11 pm
Go to the top of the page
Go to the bottom of the page

David Haley
Sorcerer
GroupMembers
Posts903
JoinedJan 29, 2007

Well, basically, if you are crashing because objects get corrupted, then saving those and thinking you can safely and fully restore from them is essentially suicide as far as stability is concerned. At best, you will lose just the portion that got corrupted; at worse you can lose huge chunks of data if you start writing to files and introduce more and more inconsistencies due to the corrupted data.

Another way to help get around this is to never write to the file until you have the entire string ready to be written. This requires writing to an intermediate destination (i.e. a string in memory, or a temp. file on disk) and then overwriting the actual file only at the very end.

But yes, emergency copyover is very dangerous. But like Samson pointed out it has already been said that Banner is totally ok with that and doesn't want to reconsider it, because for his purposes it is better than the alternative.
       
Pages:<< prev 1 next >>