Saturday, 9 April 2016

SAS Arrays

SAS Arrays : Introduction

It provides a simple, appropriate way to process a group of variables in a SAS DATA step.

Syntax
Array array-name {number-of-elements} list-of-variables;
Note: You can use [ ] or { } or ( ) for defining number of elements in the ARRAY statement.

Friday, 8 April 2016

Cartesian Product

1) Cartesian Product

A simplest join in proc sql where the final out is each row of first dataset combined with each row of the second. Its syntax is:

Proc sql;
      create table company as
            select a.empid, b.name, b.salary
            from employee as a , salary as b
            ;
Quit;

Here the final output will be product of all rows from the employee dataset with all rows in the salary dataset.
Practically this looks wrong as everyone’s salary will be everyone others’ salary, perfect Socialism J

But surprisingly this Cartesian product is the basis of all joins. Whenever you specify a join a Cartesian product is done and the output rows are restricted by certain conditions. So knowing this is necessary.

2) Cartesian Product through a datastep

This technique is not very intuitive but is asked in a lot of interviews so I am including it here:

data every_combination;
  /* Set one of your data sets, usually the larger data set */
  set one;
  do i=1 to n;
    /* For every observation in the first data set,    */
    /* read in each observation in the second data set */
    set two point=i nobs=n;
    output;
  end;
 run;

This technique creates a product of all rows.
So above is the hint for answering the question in merging SAS datasets with different id variables names. Let’s see whether you can find the answer.

for more click here or click here

Wednesday, 6 April 2016

Proc SQL joins

A Proc SQL join works same as a datastep merge to combine two or more SAS datasets. There are many types of SQL joins which can be used based on your output requirements.

1) Inner Join

An inner join provides only the matching rows from both datasets. Its syntax is:

Proc sql;
      create table company as
            select a.empid, b.name, b.salary
            from employee as a inner join salary as b
            where a.empid=b.emp_id;
Quit;

Here the final output will be all the employees having their salary information in the salary dataset. If the info for a employee is found in any one of the datasets then that employee will not be present in the final table.

1a) Inner Join with datastep:

The same inner join can be performed with a datastep merge.

data company;
  merge employee(in=a) salary(rename=(emp_id=empid) in=b);
  by empid;
  if a and b;
run;

We required the rename as datastep cannot perform a merge on id variables with different names.

Actually there is a way to merge by differently named id variables but that method is at from the notion of elegant coding and is just developed as a workaround as you can always rename id variables easily.

OK OK If you require it so much I will discuss the method towards the end.

2) Outer Join (Left)

A Left Outer join outputs the matching rows from both datasets as well as it also outputs the non-matching from the left dataset or the dataset specified first in the query. Its syntax is:

Proc sql;
      create table company as
            select a.empid, b.name, b.salary
            from employee as a left outer join salary as b
            where a.empid=b.emp_id;
Quit;

Here the final output will be all the employees whether or not they have their salary information in the salary dataset. If you want to list all the employees irrespective of their salary being updated in the salary dataset then you will use left outer join.

2a) Left outer Join with datastep:

The same join can be performed with a datastep merge. We just need to change the IF condition.

data company;
  merge employee(in=a) salary(rename=(emp_id=empid) in=b);
  by empid;
  if a;
run;

3) Outer Join (Right)
A Right Outer join outputs the matching rows from both datasets as well as it also outputs the non-matching from the right dataset or the dataset specified second in the query. Its syntax is:

Proc sql;
      create table company as
            select a.empid, b.name, b.salary
            from employee as a right outer join salary as b
            where a.empid=b.emp_id;
Quit;

Here the final output will be all the employees who have their salary information in the salary dataset as well as the employees who are not currently updated in employee dataset but their salary info was updated. The situation looks silly but in this case you will be using a right outer join.

One point to note here is that we are taking empid from the employee dataset so for these non-matching employees of the salary dataset the empid will be missing which is definitely not desirable. So as a precaution we generally use the COALESCE function in right outer joins.

Proc sql;
      create table company as
            select coalesce(a.empid,b.emp_id), b.name, b.salary
            from employee as a left outer join salary as b
            where a.empid=b.emp_id;
Quit;

This will solve the problem of missing employee ids.

3a) Right Outer Join with datastep:

The right outer join can be performed with a datastep merge similarly as the left outer.

data company;
  merge employee(in=a) salary(rename=(emp_id=empid) in=b);
  by empid;
  if b;
run;

4) Full Join

A Full join as you have rightly guessed by now outputs the matching rows from both datasets as well as it also outputs the non-matching from both the datasets. Its syntax is:

Proc sql;
      create table company as
            select coalesce(a.empid,b.emp_id), b.name, b.salary
            from employee as a full join salary as b
            where a.empid=b.emp_id;
Quit;

Here the final output will be all the employees who have their salary information either in the salary dataset or the employee dataset.

Here also we use COALESCE function or the same reason as in Right outer join.

4a) Full Join with datastep:

The full join can also be performed with a datastep merge. We just need to eliminate the if condition or for better understanding we can keep the condition as IF A OR B;

data company;
  merge employee(in=a) salary(rename=(emp_id=empid) in=b);
  by empid;
  if a or b;
run;


                   for more    click here 

Tuesday, 5 April 2016

Arrays in SAS


You must have seen arrays being used in various programming languages and must already know quite a bit about them, but before we go into the details let me tell you one thing : 

ARRAYS IN SAS ARE DIFFERENT FROM ANY OTHER LANGUAGE..!!

In most of the languages array is a data structure, holding data values, but in SAS it is not a data structure, it is just a collective name given to a group of variables. Being clear with this distinction is very integral to the part of using arrays successfully in SAS.

In SAS the most important function of the array is to reduce the lines of code where a programs involves repetitive calculations on different variables. I have seen most programmers shy away from their use, but let me tell you that it is a wonderful tool to make your code simple. Also in some instances arrays provide flexibility to make your code dynamic, I’ll get back to this later.

So let us get down to the dirty details :



What are arrays?

An Array is a grouping of variables of the same type used to perform repetitive operations on those variables.

Syntax:

Array array_name[dimension] $ length variable-list

Array – The ARRAY keyword for decleration

Array_name – Any valid SAS name (Do not use function names)

Dimension – Number of elements in the array(If unknown can use * but then we have to provide the varable list)

$ - Tells that the array is character array.

Length – Length of each variable

Variable list – List of variables to be part of the array(Can use named ranges)

Usage:

We have a dataset with 10 variables(kg1 to kg10) containing weights of a patients measure for 10 consecutive weeks. We want to convert these weight to pounds. One way we can do it is :

data kg_to_lbs;
      set weights;
      lbs1=kg1*2.2;
      lbs2=kg2*2.2;
      lbs3=kg3*2.2;
      ...
      lbs10=kg10*2.2;
run;

But this generates 13 lines of code for one simple calculation. Imagine when we have 100 such measurements, So here we can use arrays to make code shorter and easy to understand:

data kg_to_lbs;
      set weights;
      array kg_array {10} kg1-kg10;
      array lbs_array {10} lbs1-lbs10;
      do i = 1 to 10;
            lbs_array{i} = (kg_array{i})*2.2;
      end;
run;

cool na..and there is nothing complicated about this, you just need to declare an array and use a simple do loop.

Below are a few points about arrays which will be all you need to know about them to utilize arrays to their full potential.

Important points to note about arrays:

1 Variables used in an array must be of the same type. Either all numeric or all character.

2  Variables need not be preexisting variables, if they does not exist then SAS creates them for you. This converts into a useful application of creating variables through arrays.

3  SAS needs to know the size(number of elements) of the array while you are creating it. You can supply the size  in brackets next to the array name 

Array test[10] $5; 
Any array of 10 elements

Or you can let SAS count them for you using the number of variables in the variable list.

Array test1[*] $ var1 – var5

You cannot emit both the dimension and variable list together.

4)  Array does not accept numeric variables in the brackets whose value may define the dimension because it creates the array in compile time and the value of the variable used in brackets will be available only in compile time.

Array test[num_var];
ERROR : Array requires a numeric constant

5)  If we want to use all numeric or all character variables of a dataset without bothering about their names then you can declare the array like :

Array nums[*] _NUMERIC_;
Array nums[*] _CHARACTER_;

6)  Sometimes we need an array to hold values temporarily but do not want to output those variables to the final dataset then we can use temporary arrays. They are declare as :

Array arr_name[10] _TEMPORARY_;

Functions used with arrays :

DIM FUNCTION :

This function helps to determine the number of  elements in a array dynamically so while looping you do not have to hardcode it.
For e.g.

Do i=1 to 10;
A[i]= b[i] + c[i]
End;

Can also be written as :

Do i=1 to dim(a);
A[i]= b[i] + c[i]
End;

VNAME FUNCTION :

We get the values of variables in an array using the array name and subscript but if we want the name of the element(variable) by its subscript then we can use the vname function. For e.g.

array arr_name[*] X Y Z P Q R;
i=3;
vars = Vname(arr_name [i]); 

So vars will be initialized to the name of the third variable in the list which is Z.

OF OPERATOR :

This operator is very useful when we have to perform and operation on all elements(or variables) of an array. For e.g
We need a sum of all elements of an array we can write :
X=sum(a[1],a[2],a[3],a[4],a[5],a[6],a[7]…)

But if we don’t know how many elements are there or it changes every time then you need to update it again and again, so instead we can write

X=sum(of a[*]);

Cool and easy.. :-)