Recovering from EX disk failure (db> prompt)

Recovering from disk failure
One of the common reasons a switch drops out the cluster is disk corruption. If the cluster can't see the switch and vice versa, I would look at the status of the disks. Another giveaway is the LOADING JUNOS message on the LCD display on the switch.

The EX4200 switches have 2 internal disks (called internal media)


1. da0s1 (called slice 1)
2. da0s2 (called slice 2)

A USB disk can be plugged in the back of the switch and is called external media

Both the disks retain a copy of the OS and configuration files. The switch can use either of the disks to get the files it needs. When one disk is corrupt the switch will automatically use the other disk.

The following procedure is to be used to fix a corrupt disk. The general outline is as follows:

• Identify the corrupt internal disk.
• Copy all files from the good internal disk to a USB disk.
• Boot from the USB disk.
• Fix the corrupt disk.
• Boot into the fixed disk.
• Reboot the whole stack. 
• Confirm fix.

Before you start make sure you connect a console cable to the back of the malfunctioning switch. All work will be done using this console connection in case you lose your connectivity.


Procedure 

1. Identifying the corrupted disk

Console onto the stack and issue the following commands in user mode:

show system snapshot all-members slice 1 media internal
show system snapshot all-members slice 2 media internal


This will display the contents of disks da0s1 and da0s2 on all stack memebers

A good disk will look like this:

fpc0:
--------------------------------------------------------------------------
Information for snapshot on internal (da0s1)
Creation date: Jan 8 08:43:51 2010
JUNOS version on snapshot:
jbase : 10.0S1.1
jcrypto-ex: 10.0S1.1
jdocs-ex: 10.0S1.1
jkernel-ex: 10.0S1.1
jroute-ex: 10.0S1.1
jswitch-ex: 10.0S1.1
jweb-ex: 10.0S1.1
jpfe-ex42x: 10.0S1.1



For reference, fpc0 is the Flexible PIC Concentrator for Switch 0 in the stack. fpc1 is Switch 1 and so on...

A corrupt disk will look like this:

fpc0:
--------------------------------------------------------------------------
error: cannot mount /dev/da0s2a


In this case disk da0s2 is corrupt.

Now that we have identified disk 2 in switch 0 in the stack as corrupt we move on to the next step.

2. Copy all files from the good internal disk to a USB disk

Log on to switch 0 using the following command:

request session member 0

Lets look at disk 2 to confirm you are on the correct switch using the following command:

show system snapshot local slice 2 media internal

The output should be:

error: cannot mount /dev/da0s2a

Insert a 2GB USB disk into the USB slot at the back of switch 0. Take note of this point. We tried flash bigger than 2GB and USB hard disks and none of them worked. Also it seemed that the format had to be UFS or FAT32.

Copy disk 1 files to the USB disk using the following command:

request system snapshot local partition media external

Expect the following output if the operation is successful

fpc0:
--------------------------------------------------------------------------
Clearing current label...
Partitioning external media (da1) ...
Verifying compatibility of destination media partitions...
Running newfs (720MB) on external media / partition (da1s1a)...
Running newfs (217MB) on external media /config partition (da1s1e)...
Running newfs (480MB) on external media /var partition (da1s1f)...
Copying '/dev/da0s1a' to '/dev/da1s1a' .. (this may take a few minutes)
Copying '/dev/da0s1e' to '/dev/da1s1e' .. (this may take a few minutes)
Copying '/dev/da0s1f' to '/dev/da1s1f' .. (this may take a few minutes)
The following filesystems were archived: / /config /var


3. Boot from the USB disk

Issue the following command:

request system reboot local media external 

You'll get this prompt...type 'yes'

Reboot the system ? [yes,no] (no) yes

4. Fix the corrupt disk

Copy files from the USB disk to disk 2 using the following command:

request system snapshot local partition media internal slice 2 

Expect the following output if the operation is successful

fpc0:
--------------------------------------------------------------------------
Clearing current label...
Partitioning internal media (da0) ...
Verifying compatibility of destination media partitions...
Running newfs (187MB) on internal media / partition (da0s2a)...
Running newfs (56MB) on internal media /config partition (da0s2e)...
Running newfs (124MB) on internal media /var partition (da0s2f)...
Copying '/dev/da1s1a' to '/dev/da0s2a' .. (this may take a few minutes)
Copying '/dev/da1s1e' to '/dev/da0s2e' .. (this may take a few minutes)
Copying '/dev/da1s1f' to '/dev/da0s2f' .. (this may take a few minutes)
The following filesystems were archived: / /config /var


5. Boot into the fixed disk

Issue the following command:

request system reboot local slice 2 media internal 

You'll get this prompt...type 'yes'

Reboot the system ? [yes,no] (no) yes

The switch will boot into loader mode. Issue the following command at the 'loader>' prompt

loader> reboot

When booted login to the switch

Have a look around the switch and try a few show commands to satisfy yourself that everything is working fine. A good command to try is 'show virtual-chassis'

Expect to see all your switches in the stack. like the sample output below, what you want to see is status Prsnt on all the cluster members. 

0 (FPC 0) Prsnt B00000000000 ex4200-48p 128 Linecard 2 vcp-0
1 (FPC 1) Prsnt B00000000000 ex4200-48p 254 Master* 0 vcp-0
2 (FPC 2) Prsnt B00000000000 ex4200-48p 128 Linecard 4 vcp-0
3 (FPC 3) Prsnt B00000000000 ex4200-48p 254 Backup 1 vcp-0
4 (FPC 4) Prsnt B00000000000 ex4200-48p 128 Linecard 3 vcp-0


6. Reboot the Whole stack

Issue the following command:

request system reboot 

At the reboot prompt type 'yes'

Reboot the system ? [yes,no] (no) yes

7. Confirm fix


Login into the stack. Look again at the cluster status. Confirm all member status is Prsnt.

0 (FPC 0) Prsnt B00000000000 ex4200-48p 128 Linecard 2 vcp-0
1 (FPC 1) Prsnt B00000000000 ex4200-48p 254 Master* 0 vcp-0
2 (FPC 2) Prsnt B00000000000 ex4200-48p 128 Linecard 4 vcp-0
3 (FPC 3) Prsnt B00000000000 ex4200-48p 254 Backup 1 vcp-0
4 (FPC 4) Prsnt B00000000000 ex4200-48p 128 Linecard 3 vcp-0  
 
Issue the following commands

show system snapshot all-members slice 1 media internal
show system snapshot all-members slice 2 media internal


This will display the contents of disks da0s1 and da0s2 on all stack members


Complete file structures and the absence of error messages confirm success. Confirm all disks 1 and 2 look like this.


fpc0:
--------------------------------------------------------------------------
Information for snapshot on internal (da0s1)
Creation date: Jan 8 08:43:51 2010
JUNOS version on snapshot:
jbase : 10.0S1.1
jcrypto-ex: 10.0S1.1
jdocs-ex: 10.0S1.1
jkernel-ex: 10.0S1.1
jroute-ex: 10.0S1.1
jswitch-ex: 10.0S1.1
jweb-ex: 10.0S1.1
jpfe-ex42x: 10.0S1.1
fpc0:
--------------------------------------------------------------------------
Information for snapshot on internal (da0s2)
Creation date: Jan 11 06:40:09 2010
JUNOS version on snapshot:
jbase : 10.0S1.1
jcrypto-ex: 10.0S1.1
jdocs-ex: 10.0S1.1
jkernel-ex: 10.0S1.1
jroute-ex: 10.0S1.1
jswitch-ex: 10.0S1.1
jweb-ex: 10.0S1.1
jpfe-ex42x: 10.0S1.1


Done. Thanks for reading
View Comments
Recovering a Juniper J4350 should be easy because the manual says so. Just remember to take your screwdriver ;-)

A few weeks ago my lab router fell off the network. The console showed me repeating hash (#) symbols followed by a HEX dump of a few characters. I figured this looked bad...don’t know what happened. Since I didn’t need it right away I’ve left it until now. So time to have a go at fixing this now.

  • You’ll have to remove chassis from the rack (4 cage screws).
  • Put the router on the bench and take off the rack ears from both sides of the chassis.
  • Now go to the back of the chassis and you’ll see three black screws on the top. You need to unscrew these.
  • Along each of the sides of the chassis (again toward the top of the case) you’ll see three more screws. Unscrew these on both sides.
  • Now the case lid should come off.

Screen shot 2011-05-08 at 20.40.08

With the router lid off I now see there are 4 x PC3200 DIMMS toward the left hand side and the compact flash (256MB) is sat snugly against the motherboard. My task here is to reformat the flash with the ‘install’ image from the Juniper software download page. Now I’ve chosen 9.3 because I felt like it but notice that there are three flavours available for 256, 512 and 1024MB flash cards...choose the right one for your flash size - this one shows the 512MB version.

Screen shot 2011-05-08 at 20.45.16

I’ve gotten the image and extracted the flash from the 4350 (man that was a pain too because the fan buffer was in the way). Now, I don’t have a flash/PCMCIA slot in my iMAC but I do have a printer connected to it with a compact flash reader so I figure I’m going to give it a try with that. Flash plugged in I see this error by loading the Terminal, changing to root (sudo -i) then doing a ‘dmesg’ to see any kernel messages.

Screen shot 2011-05-08 at 20.36.25

Check it out, maybe my luck is changing. I see /dev/disk2s1 must be my Juniper flash card. Great now I can format it. First things first I need to uncompress the gzip file I just downloaded from Juniper. The original file was junos-jsr-9.3R4.4-export-cf256.gz. I run ‘gzip -d junos-jsr-9.3R4.4-export-cf256.gz’ to extract the file. Now I run the old faithful ‘dd’ (disk duplicate) command which is fairly common on *nix platforms to copy the contents of the archive onto the flash.

Screen shot 2011-05-08 at 20.36.01

....I wait...and wait some more...then

Screen shot 2011-05-08 at 20.52.07

Awesome - looks like the data is on now. So I replace the CF card into the chassis and power on (there is no way I’m putting all those screws back in just yet)..and...it didn’t work ;-( All I see is #’s and the fan keeps spinning up and down.

So I tried a USB stick. I took out the CF card because that is booted first. Then plug the USB flash into the front of the router and power on.

View Comments
© 2011 defaultrouteuk.com

Cisco, IOS, CCNA, CCNP, CCIE are trademarks of Cisco Systems Inc.
JunOS, JNCIA, JNCIP, JNCIE are registered trademark of Juniper Networks Inc.