Wednesday, 6 April 2016

Proc SQL joins

A Proc SQL join works same as a datastep merge to combine two or more SAS datasets. There are many types of SQL joins which can be used based on your output requirements.

1) Inner Join

An inner join provides only the matching rows from both datasets. Its syntax is:

Proc sql;
      create table company as
            select a.empid, b.name, b.salary
            from employee as a inner join salary as b
            where a.empid=b.emp_id;
Quit;

Here the final output will be all the employees having their salary information in the salary dataset. If the info for a employee is found in any one of the datasets then that employee will not be present in the final table.

1a) Inner Join with datastep:

The same inner join can be performed with a datastep merge.

data company;
  merge employee(in=a) salary(rename=(emp_id=empid) in=b);
  by empid;
  if a and b;
run;

We required the rename as datastep cannot perform a merge on id variables with different names.

Actually there is a way to merge by differently named id variables but that method is at from the notion of elegant coding and is just developed as a workaround as you can always rename id variables easily.

OK OK If you require it so much I will discuss the method towards the end.

2) Outer Join (Left)

A Left Outer join outputs the matching rows from both datasets as well as it also outputs the non-matching from the left dataset or the dataset specified first in the query. Its syntax is:

Proc sql;
      create table company as
            select a.empid, b.name, b.salary
            from employee as a left outer join salary as b
            where a.empid=b.emp_id;
Quit;

Here the final output will be all the employees whether or not they have their salary information in the salary dataset. If you want to list all the employees irrespective of their salary being updated in the salary dataset then you will use left outer join.

2a) Left outer Join with datastep:

The same join can be performed with a datastep merge. We just need to change the IF condition.

data company;
  merge employee(in=a) salary(rename=(emp_id=empid) in=b);
  by empid;
  if a;
run;

3) Outer Join (Right)
A Right Outer join outputs the matching rows from both datasets as well as it also outputs the non-matching from the right dataset or the dataset specified second in the query. Its syntax is:

Proc sql;
      create table company as
            select a.empid, b.name, b.salary
            from employee as a right outer join salary as b
            where a.empid=b.emp_id;
Quit;

Here the final output will be all the employees who have their salary information in the salary dataset as well as the employees who are not currently updated in employee dataset but their salary info was updated. The situation looks silly but in this case you will be using a right outer join.

One point to note here is that we are taking empid from the employee dataset so for these non-matching employees of the salary dataset the empid will be missing which is definitely not desirable. So as a precaution we generally use the COALESCE function in right outer joins.

Proc sql;
      create table company as
            select coalesce(a.empid,b.emp_id), b.name, b.salary
            from employee as a left outer join salary as b
            where a.empid=b.emp_id;
Quit;

This will solve the problem of missing employee ids.

3a) Right Outer Join with datastep:

The right outer join can be performed with a datastep merge similarly as the left outer.

data company;
  merge employee(in=a) salary(rename=(emp_id=empid) in=b);
  by empid;
  if b;
run;

4) Full Join

A Full join as you have rightly guessed by now outputs the matching rows from both datasets as well as it also outputs the non-matching from both the datasets. Its syntax is:

Proc sql;
      create table company as
            select coalesce(a.empid,b.emp_id), b.name, b.salary
            from employee as a full join salary as b
            where a.empid=b.emp_id;
Quit;

Here the final output will be all the employees who have their salary information either in the salary dataset or the employee dataset.

Here also we use COALESCE function or the same reason as in Right outer join.

4a) Full Join with datastep:

The full join can also be performed with a datastep merge. We just need to eliminate the if condition or for better understanding we can keep the condition as IF A OR B;

data company;
  merge employee(in=a) salary(rename=(emp_id=empid) in=b);
  by empid;
  if a or b;
run;


                   for more    click here 

No comments:

Post a Comment