Tuesday, March 28, 2006

SQL: Calculating the geometric mean (GEOMEAN)

I had a user at the last place I worked who needed to calculate the geometric mean of a column in a table. The geometric mean is like a regular mean, but has the practical effect of throwing out outliers (unusual spikes in the data). This is a standard function in Excel, but not in most databases.

The standard method of calculating the geometric mean is by multiplying all of the terms together, then taking the n-th root of the product, where n is the number of terms. So if you had 5, 7, 55, 6, and 3 as terms, you would take the fifth root of 34,650, giving you 8.0899. Notice how the final result is not very much affected by the outlier (55).

This technique won't work in most databases for non-trivial datasets. Why? Because if you take 100 terms and multiply them, it will likely overflow even the largest column type. In fact, calculating the geometric mean this way with a calculator can quickly become impossible for large data sets.

There is a solution. If you take advantage of the magic of logarithms, you can calculate the geometric mean for very large data sets without overloading a standard relational database data type.

For a given set of terms, sum the log (I will use the natural log) of all terms. Then divide this sum by the number of terms. Finally, take the anti-log of the result (since I use a natural log, this means raising e to the result). Using our example:

ln( 5) =  1.6094
ln( 7) =  1.9459
ln(55) =  4.0073
ln( 6) =  1.7918
ln( 3) =  1.0986
         -------
         10.4531

10.4531/5 = 2.0906

e^2.0906 = 8.0899

which is what we got using the standard technique. This can be translated into SQL this way:

select exp(sum(log(sample)) / count(*))
  from sample_table

-- or, depending on your flavor of SQL...
select exp(sum(ln(sample)) / count(*))
  from sample_table
UPDATE: Isabelle pointed out that there is a neater way to do this. Since the sum(FOO)/count(*) is an average, the expression can be simplified to this:

SQL Server

select exp(avg(log(sample)))
  from sample_table

Oracle, PostGreSQL, MySQL

select exp(avg(ln(sample)))
  from sample_table

Nicely figured out, Isabelle.

As far as I can tell, this will work in Oracle, SQL Server, and Access, and I can't think of why it would not work in MySQL or PosgreSQL.

MS Access: FIX: Left function doesn't work

I have an old Access database. I found that a lot of scalar functions didn't work in it. I tried something simple like this:




select left(l_name, 15)
from test_table


and got this error:




Undefined function 'left' in expression.


I hit alt-F11 to bring up the Visual Basic for Applications (VBA) IDE, and immediately got this error message:




Your Microsoft Office Access dataabse or project contains a missing or broken reference to the file 'MSOWC.DLL' version 1.0


I check my references (tools->references) and saw an entry: "MISSING: Microsoft Office Web Components" which referenced the file "C:/Program Files/Microsoft Office/OFFICE10/MSOWC.DLL"



I went looking for the file-- the first thing I found out was that, althought I have an OFFICE10 directory, I'm running Office 2003 out of OFFICE11. There was no corresponding DLL in the OFFICE11 directory.



I simply unclicked the MISSING reference, saved the file, and it all works fine now. I suspect that the missing file is unneeded in 2003.

Yoga classes in Philadelphia

Since the search for lunchtime kung fu/tai chi classes didn't work out (see No Lunchtime Kung Fu in Center City), I'm looking for yoga classes now. You say flaky, I say eclectic. Anyway, I've turned up two likely studios.

The first one is Bikram-- I went to one Bikram class in New Orleans before we evacuated, and it was cool. Or rather, not, it was done in a 100 degree room. It truly kicked my ass, but in a way that I liked.

Bikram Yoga of Philadelphia
http://www.bikramphiladelphia.com/schedule.shtml
M-F - 06:15 - 7:45, 12:00 - 13:30, 19:30 - 21:00
$20 first week,
10 classes - $120, month unlimited $150

The second one does Ashtanga yoga, which I don't know anything about. However, I like that they do a 12:00 - 12:45 class. That's realistic. One thing I picked up in my Kung Fu search was that studios sometimes take the attitude that if you can't bend your life around the class, you aren't dedicated enough. I guess I'm not very dedicated, but I *would* like to get in some form of exercise.

The Yoga Loft
http://www.theyogaloftphilly.com/schedule.html
M-F 07:00-08:00
M,W,F 12-12:45

By the way, the info I have here is only for classes that worked for my schedule. Go to the websites listed to see the entire schedule.

Thursday, March 23, 2006

39:45 South Philadelphia Loop - Chelsea Williams

It's been a while since my last run, and o lord could I feel it. I was about to go microwave some couscous when a tiny voice in my mind reminded me that I am 41 and will quickly become a tremendous lard-ass if I just eat and never exercise.



I found Chelsea Williams through a friend of a friend. She's a busker in Los Angeles. Man, busking must be tough in Los Angeles if this is the level of quality. I busked quite a lot in Barcelona, and I always thought I was pretty good, but I suck by comparison.



Unfortunately, I wasn't able to download more than two of Chelsea Williams's songs, so I filled out the rest with standbys. Here was my playlist:




  1. Undecided - Chelsea Williams

  2. This Will Be - Chelsea Williams

  3. Te Recuerdo Amanda (I remember you Amanda) - Jose Mercé

  4. La autoradio canta (The car radio sings) - Miguel Bosé

  5. Dime quién soy yo (Tell me who I am) - Niña Pastori

  6. I'm Digging Your Scene - Dr. Robert

  7. Springtime for the World - Dr. Robert

  8. Telephone - Zap Mama



I got into some flamenco today (Jose Mercé & Niña Pastori). And the Miguel Bosé song was a favorite from 1996.



The Dr. Robert songs are from a new acoustic album coming out from the former singer of "The Blow Monkeys".



The run itself was very hard-- I got really tired about halfway through. Not out of breath, oddly enough, just generally worn out. My god I'm an old fart.

Tuesday, March 21, 2006

Microsoft Paint (PBRUSH) Text Toolbar Disappears

I use Microsoft Paint (pbrush) a lot. I recently had the problem that my Text Toolbar just disappeared. I tried selecting and unselecting "Text Toolbar" from the view menu, but nothing worked.



You can fix this from Regedit. The keys are:




My Computer/HKEY_CURRENT_USER/Software/Microsoft/Windows/CurrentVersion/Applets/Paint/Text/PositionX

and PositionY


From Regedit you can set both of these to 0 and the text toolbar will show up on screen again.

Thursday, March 9, 2006

HP-32SII :: Hewlett-Packard RPN Scientific Calculator

What this page is about

I set up this page to see if I could find any other HP32/HP32S/HP32SII
enthusiasts (that is, geeks). I had a great HP-15C when I was
in University, but alas, I could not find another when it came
time to buy a new calculator (I sold the 15C like an idiot).
The HP32SII seemed to fit my bill. It was RPN, programmable,
and had a lot of the same features of the HP15C.


Sample Program
Euclid's Algorithm for the Greatest Common Divisor

; This is an HP-32S version of Euclid's Algorithm to find the greatest common divisor.
; You run this by putting the two numbers for which you want to find the GCD and pressing "XEQ E"
; I'm modifying this to put the non-shifted
; key in brackets next to the shifted key
; to make it less confusing. (Thanks, Stefan)
; Where I say oShift, I mean hit the orange shift key first
; Where I say bShift, hit the blue shift key first
E01 LBL E ;[oShift +]
E02 STO A
F01 LBL F ;[oShift +]
; Divide y register by x register
F02 ÷
; Get the Fractional part
F03 FP ;[bShift square root]
F04 RCL A
; Multiply the fractional part by the first number you put in
F05 x
F06 1
F07 x>y? ;[oShift ÷] then select >
F08 GTO G ;[oShift XEQ]
F09 R(DOWN)
F10 PSE ;[bShift R/S]
; Exchange the x-register with STOrage A
F11 x <-> A ;[bShift x<->y] (next to ENTER)
F12 RCL A
F13 GTO F ;[oShift XEQ]
G01 LBL G
G02 RCL A
G03 RTN


Links

SQL Server: Understanding SYSFILES/SYSALTFILES status

SYSFILES and SYSALTFILES describe the data files used by SQL Server databases. They have the same structure, except that SYSFILES refers to data files in the current database, and SYSALTFILES resides in the MASTER database and refers to all data files in the database.



The STATUS column of SYSALTFILES is made up of a series of flags to tell you properties of a data file. The following query checks for the flags. A zero indicates the flag is not set, a non-zero indicates that the flag is set:




select name,
status,
status & 0x1 [Default device (unused in SQL Server 2000)],
status & 0x2 [Disk file],
status & 0x40 [Log device],
status & 0x80 [File has been written to since last backup],
status & 0x4000 [Device created implicitly by CREATE DATABASE],
status & 0x8000 [Device created during database creation],
status & 0x100000 [Growth is in percentage, not pages]
from master..sysaltfiles


There is more information on this table in Inside SQL Server 2000 by Kalen Delaney from Microsoft Press.

SQL Server: Data file monitor stored procedure

This is a script that lets you monitor datafile sizes in SQL Server. It collects the sizes of all data files in the database and stores them in a history table. You can set this up to run with a SQL Server Agent Job to collect data file sizes nightly.



The table the history gets stored in looks like this:



create table altfiles_history (
date datetime,
dbid smallint,
fileid smallint,
size int
)


The stored procedure code looks like this:




/*
* size_monitor
* Timothy Chen Allen
* http://www.timallen.org/blog
*/
create procedure dbo.size_monitor
as
set nocount on

-- Cursor: names and sizes of data files
declare c_dbsize cursor for
select dbid, fileid, size
from master..sysaltfiles

-- Declare variables into which to fetch
declare @dbid smallint
declare @fileid smallint
declare @size int

open c_dbsize

fetch next from c_dbsize
into @dbid, @fileid, @size

while @@fetch_status = 0
begin
fetch next from c_dbsize
into @dbid, @fileid, @size

insert into trf_dba_utilities..altfiles_history (date, dbid, fileid, size)
values (getdate(), @dbid, @fileid, @size)
end

return


You can query the history table to see how your data files are growing with this query:




/*
* 1) size and maxsize are displayed in 8K blocks. You can
* get this to MBs by dividing by 128.
* 2) maxsize is -1 when the datafile is allowed unlimited
* growth.
*/
select s.date,
d.name database_name,
f.name datafile_name,
s.size/128.0 datafile_size,
case when f.maxsize = -1
then -1
else f.maxsize/128.0
end max_datafile_size
from altfiles_history s
left join master.dbo.sysaltfiles f on s.dbid = f.dbid and s.fileid = f.fileid
left join master.dbo.sysdatabases d on s.dbid = d.dbid
order by d.name, f.name, s.date


Basically, I'm just storing the size column and keys from SYSALTFILES. This history is a nice thing to have so you can watch file growth and see if you need to be automatically shrinking databases, or to watch for unexpected jumps in size.

Wednesday, March 8, 2006

The Katrina Tapes

I thought I had just about written my last Katrina category blog. Guess I was wrong.

I just watched the tapes of the President's briefing four days prior to Katrina hitting New Orleans: http://video.ap.org/v/en-ap/v.htm?f=CADIU . I dunno. Go watch them yourself.

The tape of the White House Press Secretary shows Scott McClellan attempting to spin this somehow to say that the President was blameless. I'm tired of that. Even I was not blameless in this-- there were things I could have done to better ensure my family's safety. I could have done a lot of things better.

That's why it makes me angry to see the most powerful man in the world dance around like a fly girl trying to convince us that he never did a damn thing wrong. Everyone knows he screwed up-- it just makes him look like a liar on top of it for him to claim that he didn't. The kind of excuse making I'm hearing out of the White House in that press conference had a name when I was at the Academy: Sea Lawyering. It's not owning up to one's responsibilities because of some loophole. It's trying to pin a company-level screwup on some PFC in the third rank. It's not being President-- it's just cowardice.