아직 해결되지 않은 문제며, 해결되는대로 이에 관련한 퀵리퍼런스가 작성될것으로 보인다.
ERROR MESSAGE IN DIRECT CONSOLE
warning: many lost ticks.
Your time source seems to be instable or some driver is hogging interupts
rip __do_softirq+0x4d/0xd0
Falling back to HPET
Outbreaking Condition (오류발생 조건)-
Environment / 환경
1. TYAN TEMPEST I5000PX (S5380) - Xeon Woodcrest 2.66Ghz Dualcore / DDR2 1GB FB 5300 ECC
* In this issue (이 오류는 1번에 해당한다)
2. NVIDIA NFORCE 4 CK804 - AMD Athlon 64 X2 Dual Core 3800+
displaying this message "Losing some ticks... checking if CPU frequency changed."
Add Note V1:
I decide that is a different case which two type of system (wrote in environment)
오류가 두가지로 나뉘는것을 확인하였다.
Operating Condition / 운영조건
1. HTTPD
when server load average reached or over about 15.
(means software using extremely large resource)
performance decrease about 40% approxmile
(아파치 서버의 경우 15정도의 로드가 발생하였을때 / 이정도는 엄청나게 많은 리소스 사용량을 의미한다.
이러한 현상이 발생하며, 서버의 전체적인 퍼포먼스가 40%가량 감소한다)
2. MYSQL
when server load average reached or over about 40.
(means software using extremely large resource)
/ in general case the mysql doesn't over loadaverage 5.0, but the other case (error kernel)
OS can't handle resource anymore
if after shown the error messages, server load value going to over 900 and no more handle/operating.
can't shutdown mysql or etc.
(mysql 서버의 로드가 40이상 올라갈 때 발생하며 한번 발생하면 로드가 900이상 즉시 올라가며,
더이상 서버에서 어떠한 운영도 불가능하다.
동일한 데이터 입/출력을 다른서버 (유사한 CPU환경) 에서 발생하였을때 로드가 5.0을 넘기지 않아야하나
이 경우 쉽게 15를 돌파하며 50을 넘겨버린다. 자원의 순환이 되고 있지 않는것을 의미한다.)
Tried kernels / 시도한 커널들
2.6.9-42.0.10.plus.c4smp
2.6.9-42.0.10.ELsmp
2.6.20.6 - custom build
i've been tried to changing values belown. whole setting failed.
아래의 값을변경하며 시도해보았으나 실패하였다.
on BIOS -
ACPI APIC / enable /disable
Memory Remap Features / enable /disable
MAX CPUID VALUE LIMIT / enable /disable
OS handle CPU FREQUENCY / enable /disable
Power Management timer / enable /disable
on Kernel -
clock=pmtmr
clock=pti
noapic
all-generic-ide with bios ahci handling
Expecting caution / 예상오류원리
maybe huge i/o breaking communication cpu to APIC clock counter or specific acpi clock counter.
or kernel driver's detecting failure.
정확하지는 않지만 ACPI 와 관련된 특정한 클럭 계통, APIC 클럭 카운터일수도 있다.
이부분을 과도한 I/O로 인해 CPU가 핸들아웃되는 상황이 발생되는것이 아닌지 추정하고 있다.
은제해결되려나 -_-;;;;;;;;
Solution
Primary Case.
solved out.
update kernel to 2.4.10.c4smp (centos-plus based)
update bios to 1.04 (if u r using tyan s5380)
on bios/
-> turn on max cpuid value set to 3
-> multimedia timer / on
-> apic/acpi/local acpi / on
-> os handle cpu freq / disable
(if problem doesn't appear turn off Hyper Thread technology)
clear grub loader configuration ( no noirqbalance or acpi_irq_balance or clock ... etc)
on boot/
rc.local -> put this line above
ethtool -s eth0 speed 1000 duplex full autoneg off wol d
ethtool -A autoneg off rx off tx off
ethtool -K eth0 tso off
i figure out problem occur from e1000 software driver.
if you got the message after turn off e1000 setting, trying to turn off hypher thread technology from bios set.
e1000 드라이버에 의한 문제로 밝혀졌다. 하지만 문제가 중복되어있었다.
(실제로 이것을 끄자 다른 IRQ에서 문제가 중첩되어 bios설정상에서 대다수 변경을 시도하였다)
그러나 문제가 지속되면 하이퍼스레드 기능을 끄면 트레이스한 결과상 문제가 사라져야한다.
In secondary case
2. NVIDIA NFORCE 4 CK804 - AMD Athlon 64 X2 Dual Core 3800+Setting ACPI, APIC, careful to set S-ATA mode to compartible setting
displaying this message "Losing some ticks... checking if CPU frequency changed."
Kernel Can't detecting nf4 ck804 sata controller so if you are using this
probably you are under running "all-generic-ide" option.
but that is real problem. you should remove "all-generic-ide" option from kernel loader.
just adjusting cmos setting.
it occur apic clock losting.
in addition, intel 965RY DG,SS series Can't operating to stable under linux system.
if you shown direct error message like this "APIC error on CPU0: 00(60)" you should change serveral cmos values. i can't decision which what is extacly working values of yours.
and some people said opinion about that message to "it doesn't matter harmless, apic didn't effecting to System Performance". but actually I founding performance changing.
Definitly apic error message is effictivity system performance.
APIC error on CPU0: 00(60) -> occur Clock losing -> Losing some ticks... checking if CPU frequency changed." -> performance down!
2항의 경우에는 ACPI 와 APIC와 S-ATA의 모드를 compartible등으로 적절히 설정하여 부팅시켜야 한다.
통싱적으로 커널이 NF4 칩셋을 인식하지 못해 all-generic-ide 를 활성화 하는 경우가 있지만, 이것이 문제의 온상이 되는부분이다.
추가적으로 지난번 설명을 보면 알지만 이와 같은 논리로 인텔 965RY등의 보드는 "리눅스"에서 정상적으로 동작이 불가능하다.
만약 APIC error on CPU0: 00(60) 의 에러가 보인다면 설정이 잘못된것이다. 혹자는 이가 퍼포먼스에 영향이 없다고는 하지만 실제 누적되는 영향을 가져오는것을 발견하였다.
APIC에러 -> 클락 루징 -> 틱계산 문제발생하여 CPU 오류메시지 -> 성능 저하
그 누가 지껄였던가. 레드햇바보들..
Completly Stable System Messages / 완벽히 안정된 시스템 메시지
belown are my G965/NF4 system's dmesg
Case G965.
ACPI: RSDP (v002 ACPIAM ) @ 0x00000000000fa950
ACPI: XSDT (v001 MSTEST OEMXSDT 0x01000718 MSFT 0x00000097) @ 0x000000007e790100
ACPI: FADT (v003 MSTEST OEMFACP 0x01000718 MSFT 0x00000097) @ 0x000000007e790290
ACPI: MADT (v001 MSTEST OEMAPIC 0x01000718 MSFT 0x00000097) @ 0x000000007e790390
ACPI: MCFG (v001 MSTEST OEMMCFG 0x01000718 MSFT 0x00000097) @ 0x000000007e790400
ACPI: SLIC (v001 MSTEST TESTONLY 0x01000718 MSFT 0x00000097) @ 0x000000007e790440
ACPI: OEMB (v001 MSTEST AMI_OEM 0x01000718 MSFT 0x00000097) @ 0x000000007e79e040
ACPI: HPET (v001 MSTEST OEMHPET 0x01000718 MSFT 0x00000097) @ 0x000000007e7981d0
ACPI: GSCI (v001 MSTEST GMCHSCI 0x01000718 MSFT 0x00000097) @ 0x000000007e79e0c0
ACPI: DSDT (v001 A0518 A0518000 0x00000000 INTL 0x20060113) @ 0x0000000000000000
No NUMA configuration found
Faking a node at 0000000000000000-000000007e790000
Bootmem setup node 0 0000000000000000-000000007e790000
No mptable found.
On node 0 totalpages: 518032
DMA zone: 4096 pages, LIFO batch:1
Normal zone: 513936 pages, LIFO batch:16
HighMem zone: 0 pages, LIFO batch:1
DMI 2.4 present.
ACPI: PM-Timer IO Port: 0x808
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 6:15 APIC version 16
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
Processor #1 6:15 APIC version 16
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x82] disabled)
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x83] disabled)
Setting APIC routing to flat
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 2, version 32, address 0xfec00000, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
ACPI: HPET id: 0x8086a202 base: 0xfed00000
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at 80000000 (gap: 7e800000:80600000)
Checking aperture...
Built 1 zonelists
Kernel command line: ro root=LABEL=/ console=tty0
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 131072 bytes)
time.c: Using 14.318180 MHz HPET timer.
time.c: Detected 2400.008 MHz processor.
Case NF4.
Bootdata ok (command line is ro root=LABEL=/)
Linux version 2.6.9-42.0.10.EL (mockbuild@builder5.centos.org) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)) #1 Tue Feb 27 09:18:57 EST 2007
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 00000000d7ff0000 (usable)
BIOS-e820: 00000000d7ff0000 - 00000000d7ff3000 (ACPI NVS)
BIOS-e820: 00000000d7ff3000 - 00000000d8000000 (ACPI data)
BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
ACPI: RSDP (v000 Nvidia ) @ 0x00000000000f9270
ACPI: RSDT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x00000000d7ff3040
ACPI: FADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x00000000d7ff30c0
ACPI: MCFG (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x00000000d7ff9680
ACPI: MADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x00000000d7ff95c0
ACPI: DSDT (v001 NVIDIA AWRDACPI 0x00001000 MSFT 0x0100000e) @ 0x0000000000000000
No mptable found.
On node 0 totalpages: 884720
DMA zone: 4096 pages, LIFO batch:1
Normal zone: 880624 pages, LIFO batch:16
HighMem zone: 0 pages, LIFO batch:1
DMI 2.3 present.
Nvidia board detected. Ignoring ACPI timer override.
ACPI: PM-Timer IO Port: 0x4008
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 15:15 APIC version 16
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] disabled)
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
Setting APIC routing to flat
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 14 global_irq 14 high edge)
ACPI: INT_SRC_OVR (bus 0 bus_irq 15 global_irq 15 high edge)
ACPI: IRQ9 used by override.
ACPI: IRQ14 used by override.
ACPI: IRQ15 used by override.
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at f1000000 (gap: f0000000:ec00000)
Checking aperture...
CPU 0: aperture @ 8822000000 size 32 MB
Aperture from northbridge cpu 0 too small (32 MB)
No AGP bridge found
Built 1 zonelists
Kernel command line: ro root=LABEL=/ console=tty0
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 131072 bytes)
time.c: Using 3.579545 MHz PM timer.
time.c: Detected 2010.321 MHz processor.
time.c: Using PIT/TSC based timekeeping.
........................ skip some parts..............................
NFORCE-CK804: IDE controller at PCI slot 0000:00:06.0
NFORCE-CK804: chipset revision 242
NFORCE-CK804: not 100% native mode: will probe irqs later
NFORCE-CK804: 0000:00:06.0 (rev f2) UDMA133 controller
NFORCE-CK804: neither IDE port enabled (BIOS)
Probing IDE interface ide0...
Probing IDE interface ide1...
Probing IDE interface ide2...
Probing IDE interface ide3...
Probing IDE interface ide4...
Probing IDE interface ide5...'
.... skip some message ...
SCSI subsystem initialized
libata version 1.20 loaded.
sata_nv 0000:00:07.0: version 0.8
ACPI: PCI interrupt 0000:00:07.0[A] -> GSI 20 (level, low) -> IRQ 201
PCI: Setting latency timer of device 0000:00:07.0 to 64
ata1: SATA max UDMA/133 cmd 0x9F0 ctl 0xBF2 bmdma 0xCC00 irq 201
ata2: SATA max UDMA/133 cmd 0x970 ctl 0xB72 bmdma 0xCC08 irq 201
ata1: SATA link up 1.5 Gbps (SStatus 113)
ata1: dev 0 cfg 49:2f00 82:74eb 83:7fea 84:4023 85:74e8 86:3c02 87:4023 88:203f
ata1: dev 0 ATA-6, max UDMA/100, 160836480 sectors: LBA48
nv_sata: Primary device added
nv_sata: Primary device removed
nv_sata: Secondary device added
nv_sata: Secondary device removed
ata1: dev 0 configured for UDMA/100
scsi0 : sata_nv
ata2: SATA link down (SStatus 0)
scsi1 : sata_nv
Using cfq io scheduler
Vendor: ATA Model: HDS722580VLSA80 Rev: V32O
Type: Direct-Access ANSI SCSI revision: 05
Posted by LeCieL



