sas en knime

Embed Size (px)

Citation preview

  • 8/9/2019 sas en knime

    1/66

  • 8/9/2019 sas en knime

    2/66

    2

  • 8/9/2019 sas en knime

    3/66

    3

    Copyright©2011 by KNIME Press

    All Rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, ortransmission in any form or by any means, electronic, mechanical, photocopying, recording or likewise.

    For information regarding permissions and sales, write to:

    KNIME PressTechnoparkstr. 18005 ZurichSwitzerland

    [email protected]

    ISBN: 978-3-9523926-1-4

    SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

    mailto:[email protected]:[email protected]:[email protected]

  • 8/9/2019 sas en knime

    4/66

    4

    Table of ContentsForeward ............................................................................................................................................................................................................................................................... 7

    Data and Script Structure ...................................................................................................................................................................................................................................... 9

    Script Structure how to build an analysis workflow ................................................................................................................................................................................... 10

    The programming environment ...................................................................................................................................................................................................................... 11

    LIBNAME path to workspace .................................................................................................................................................................................................................... 12PROC PRINT display data table ................................................................................................................................................................................................................... 13

    DELETE delete data table / increase the heap space ................................................................................................................................................................................ 14

    Include other script languages ............................................................................................................................................................................................................................ 15

    PROC SQL Java, R, Python, Perl Snippet ................................................................................................................................................................................................. 16

    Input/Output ....................................................................................................................................................................................................................................................... 17

    INFILE read data from text file .................................................................................................................................................................................................................. 18

    INPUT … datalines create data inside the script ......................................................................................................................................................................................... 19

    connect to /disconnect from connect/disconnect to/from database ....................................................................................................................................................... 20

    PROC IMPORT … DBMS=EXCEL read data from Excel file........................................................................................................................................................................ 21

    Read SAS datasets from KNIME ....................................................................................................................................................................................................................... 22

    Reporting ............................................................................................................................................................................................................................................................. 23

    PROC REPORT create a report ................................................................................................................................................................................................................. 24

    Grouping / Sorting ............................................................................................................................................................................................................................................... 25

    PROC SUMMARY/MEANS calculate statistics for groups of data rows ................................................................................................................................................... 26

    PROC TABULATE shape a table with the aggregation values of one or more columns ............................................................................................................................ 27

    PROC SORT sort data ............................................................................................................................................................................................................................ 28

    BY grouping data rows ........................................................................................................................................................................................................................... 29

    count enumeration variable.................................................................................................................................................................................................................... 30

    first. finding the first row in a row group .................................................................................................................................................................................. 31

    nmiss() count missing values in a row ....................................................................................................................................................................................................... 32

  • 8/9/2019 sas en knime

    5/66

    5

    Join/Concatenate ................................................................................................................................................................................................................................................. 33

    SET concatenate data tables ................................................................................................................................................................................................................... 34

    MERGE merge/join data sets ................................................................................................................................................................................................................. 35

    WHERE row filter ................................................................................................................................................................................................................................... 36

    KEEP/DROP column filter .......................................................................................................................................................................................................................... 37

    String Manipulation ............................................................................................................................................................................................................................................. 38left()/right()/strip()/trim() remove blanks on the left/right/everywhere/on the sides ............................................................................................................................ 39

    upcase()/lowcase()/propcase() convert a string from lowcase to upcase and vice versa ......................................................................................................................... 40

    substr() String Replacer and Cell Splitters .............................................................................................................................................................................................. 41

    scan() find words in a string .................................................................................................................................................................................................................... 42

    countw() count words in a string ............................................................................................................................................................................................................. 43

    index ()/indexw() find index of first occurrence of pattern in string values .............................................................................................................................................. 44

    put() type conversions ............................................................................................................................................................................................................................. 45

    PROC FORMAT format integer ranges or string values ............................................................................................................................................................................ 46

    Logical Operations ............................................................................................................................................................................................................................................... 47

    IF if- then construct .................................................................................................................................................................................................................................. 48

    in() find whether one specific value belongs to a list of values ............................................................................................................................................................. 49

    missing() find missing values .................................................................................................................................................................................................................... 50

    do … end loops .......................................................................................................................................................................................................................................... 51

    Date/Time Manipulation ..................................................................................................................................................................................................................................... 53

    day() / year() / month() date field extractor............................................................................................................................................................................................... 54

    today() calculate and set today’s date ....................................................................................................................................................................................................... 55

    mdy() build date from (month, day, year) .............................................................................................................................................................................................. 56

    Math Operations ................................................................................................................................................................................................................................................. 57

    int()/ceil()/round()/floor() Double to Int conversion ................................................................................................................................................................................. 58

  • 8/9/2019 sas en knime

    6/66

    6

    max()/min() find maximum and minimum value in a row .......................................................................................................................................................................... 59

    max()/min() find maximum and minimum value across workflow variables ............................................................................................................................................. 60

    mean()/sum() average/sum across row values ....................................................................................................................................................................................... 61

    TRANSPOSE transpose data table ............................................................................................................................................................................................................. 62

    Basic Statistics...................................................................................................................................................................................................................................................... 63

    PROC FREQ statistical variables from two columns plus chi-square and Fisher test ............................................................................................................................ 64

  • 8/9/2019 sas en knime

    7/66

    7

    Foreward

    OK, I admit to being a 34 year SAS addict – 25 of them working directly for SAS – so when I was first introduced to KNIME I was skeptical. Open source? Communitysupported? Everything in graphical workflows? FREE full function deskt op Version? VERY different from the SAS I know! To me it sounded more like “bells and whistles”than “solid and robust” and I wondered why the Gartner group gave it “Cool Vendor Status” and what all those pharmaceutical companies saw in it. Now, after using theproduct myself for a number of major projects, I can truly say that KNIME is very powerful indeed!

    Working with Rosaria Silipo meant that I could climb the software learning curve very quickly – she knows both products well, and her publishing this booklet means thatYOU can now get up to speed very quickly.

    I appreciate the fact that Rosaria has focused on comparing CORE SAS functionality to KNIME. Any experienced SAS user knows that a SAS application is, at the end of theday, a program that combines data steps, procs, macro language, and a front end – combined with the knowledge of how to USE them in an application created for aspecific business area. Looking at core functionalities is how we as SAS users most effectively learn something new.

    There are some basic similarities: both SAS and KNIME pull data sensibly into memory and optimize the interplay between I/O and processors and clusters, because bothsuppliers know that at some point, no matter how fast in- memory manipulation is big data will require loading and processing. Both fundamentally understand that it’snot just about statistics and reports, it’s about accessing data – any data – and manipulating that data to get it back out in a usable form. Both believe that a platformthat integrates across desired components is the right way to go.

    Both also try to provide as many techniques as possible in the analytics area. But while SAS employs its large, dedicated team of thousands of developers to generateeverything possible in its proprietary infrastructure (and only once it’s developed and rolled out is it, in fact, possible), KNIME takes the approach that there will ALWAYSbe some unforeseen requirement for an esoteric o r specializ ed technique, so it simply insures that it’s easy to pull that technique directly from another source – whetherthat be R or Weka or you name it.

    Both SAS and KNIME allow beginners to start out small and focused while learning. But once up to speed, you really need the ability to scale up quickly to accommodatemany users and a lot of data without having to rewrite programs; the two platforms also have this positive trait in common. But there are a few major philosophicaldifferences that, when understood, will help you get going with KNIME faster.

    Most important is probably the paradigm difference: fundamentally, SAS is a line-oriented scripting language. Everything on top of base SAS, including all the applicationsand layers successfully developed in the last several years, still just generates that line-oriented code at the end of the day. In SAS, you iterate through data steps andprocs and back and forth until you finally have what you need. KNIME is a node-workflow oriented approach – for absolutely E VERYTHING. You don’t have to learn thedata step language, then the PROC syntax, then a MACRO language for packaging code, and finally one of the application languages (Java, .net or proprietary SAS) inorder to build and surface programs to end users – it's ALL done in nodes and workflows. That means in the beginning you may have to search for the correct nodes andcombinations thereof to achieve what you need to do, but once you are up and running you can quickly mix and match the different types of nodes as needed.

  • 8/9/2019 sas en knime

    8/66

    8

    Another major difference with KNIME: what you program = what you run = what you document = what you can visualize to your audience . Once you’ve built a workflow,it’s available for all required purposes – you don’t have to jump over to a graphics tool and mock up a flow or draw a picture to explain what’s actually going on behindthat exotic-looking macro code!

    Probably the biggest difference though is in the approach to packaging and expanding your application. The KNIME concept of a metanode – taking a series of linkednodes in a workflow and creating ONE reusable node out of them, with all inputs/outputs defined – is extremely powerful.

    Let’s be clear what KNIME is not: it is not attempting to be the ultimate in -memory BI reporting packages; rather, KNIME makes sure that, if needed, you can surfaceeverything through your favorite reporting tool, whatever that is. KNIME is also not trying to be the end-all in MIS batch reporting, although the open source BIRTplatform provided with KNIME will get you started. And while KNIME already has a huge number of business- oriented examples (check out the website!), it doesn’t(yet?) strive to make end-to-end, complete applications. What KNIME does do is make it easy to include and reuse other applications, whether that be from the popularR package or your favorite coding language such as Java, You can also easily CALL other applications – even SAS – to build exactly what you require. And one last tip fromme: start with this booklet and your favorite SAS dataset. Using the special SASB7DAT node, KNIME will read that data in with just one click – which is not only super-simple and fast but also gives you an edge in allowing you to start learning with data that you already know.

    I hope you enjoy using KNIME as much as I do!

    Phil Winters

    Strategic Advisor, Peppers & Rogers GroupCIAgenda

  • 8/9/2019 sas en knime

    9/669

    Data and Script Structure

  • 8/9/2019 sas en knime

    10/66

    10

    Script Structure how to build an analysis workflowSAS KNIME

    Script and data steps

    A SAS analysis is performed by means of a script, which consists mainly of data stepsand proc steps.

    A number of SAS products use a more visual approach. However, the graphic layout istranslated into the corresponding script before execution.

    Workflow and nodes

    KNIME implements visual programming. This means that each data analysisstep is represented by means of an icon block, called node , in a graphicaleditor.

    A sequence of connected nodes is called a workflow and is the correspondingconcept of a script in SAS.

    Data is organized through data tables with column headers and Row IDs. Tolearn how to visualize the content of a data table, s ee “PROC PRINT” later on.

    Example:

    DATA xyzset x y z;

    RUN;

    PROC TABULATE data= xyz;class = education, gender;var income;table gender, education * N;

    RUN; Note . Nodes have three possible status displayed by a little traffic light under thenode itself:

    - Not yet configured or error in execution/configuration -> red light - Configured but not executed -> yellow light - Successfully executed -> green light

    For more details about KNIME, check:- Rosaria Silipo, “ KNIME Beginner’s Luck ”, KNIME Press , 2010- Rosaria Silipo, Michael P. Mazanetz, “ The KNIME Cookbook: Recipes for the

    advanced user ”, KNIME Press, 2011

  • 8/9/2019 sas en knime

    11/66

    11

    The programming environmentThe KNIME workbench

    In the folder where you installed KNIME, start knime.exe .The KNIME workbench opens, which includes: “Workflow Editor”, “Workflow Projects”, “Node Repository” , ”Node Description” , “Outline”, and “Console”. The “Workflow Projects ” panel shows the list of currently available projects for the selected workspace.The “Node Repository ” panel contains all available nodes. A “ Search ” box is available on top of this panel to search for nodes.

    The “Workflow Editor ” in the center allows the creation and the editing of the workflows.If a node is selected either in the “Workflow Editor” or in the “Node Repository” panel, the “ Node Description ” panel shows a description of the node task. The “Outline” panel offers an overview of the workflow and the “Console” panel shows possible execution messages.

    KNIME workflows are created by dragging nodes from the “Node Repository” panel and dropping them into the “Workflow Editor”.Nodes are connected to each other through their input and output ports. Just click the output port of the first node and release on the input port of the second node.Just created nodes have a red light status: not yet configured. To configure a node, right- click the node and select the option “Configure” or alternatively double -clickthe node. The “Configuration” window of this node opens. Configure the node. If the configuration is successful, the node sta tus changes to a yellow traffic light.The node is now configured, but not yet executed. To execute the node, right- click the node and select the “Execute” option. If the execution is successful, the nodechanges its status to a green light.

  • 8/9/2019 sas en knime

    12/66

    12

    LIBNAME path to workspaceSAS KNIME

    libname ;

    Workspace

    Workspace defines the folder where all workflows and intermediate data are saved.

    The path to the workspace is selected at the very beginning after starting knime.exe .

    You can still change the workspace after KNIME has been launched , by selecting “File”in the top menu and then “Switch Workspace”.

    Example:

    libname mylib 'C:\Data\ Course ;

  • 8/9/2019 sas en knime

    13/66

    13

    PROC PRINT display data tableSAS KNIME

    PROC PRINT

    data=;run;

    Example:

    PROC PRINT data=xyz;

    RUN;

    Display data table on th e node’s output port

    The resulting data tables created by nodes are always available. To see them:- Right-click the node in the workflow- Select the last option in the context menu

    Note . Some nodes, like plotting and modeling nodes, have also a more complex “View”function. Th e option leading to this “View” is usually displayed after the “Node name anddescription” option.

  • 8/9/2019 sas en knime

    14/66

    14

    DELETE delete data table / increase the heap space

    SAS KNIME

    proc datasets;delete ;

    run;

    KNIME's memory management ensures disposal of tables that are notused anymore so no equivalent to the DELETE-command in SAS isneeded.

    Example:

    proc datasets;delete xyz;

    run;

  • 8/9/2019 sas en knime

    15/66

    15

    Include other script languages

  • 8/9/2019 sas en knime

    16/66

    16

    PROC SQL Java, R, Python, Perl SnippetSAS KNIME

    proc sql ;create table as

    select from where order by ;

    quit ;

    Java Snippet

    KNIME offers a few Snippet nodes to write and execute pieces of Java, R, Python, andPerl code. Particularly the “Java Snippet” and the “R Snippet” nodes are very powerful,since they can leverage all the built-in libraries of the corresponding script/language.

    Example:

    proc sql;create table Filter as

    select Pricefrom xyzwhere nr > 13order by CustID;

    quit ;

  • 8/9/2019 sas en knime

    17/66

    17

    Input/Output

  • 8/9/2019 sas en knime

    18/66

    18

    INFILE read data from text fileSAS KNIME

    data infile delimiter=;input ... ;run;

    File Reader

    Note 1. Possible reading formats are: Integer, Double, and String. A date must be read as String andlater on converted to DateTime type with a “String To Date/Time” node. Note 2. KNIME has also a specific “CSV Reader” node dedicated to reading CSV files.

    Example:

    data xyz; infile „C: \data\CRM .txt delimiter=';';input CustomerID ContractID date_start date_end

    amount;format date_start date9.;format date_end date9.;

    run;

  • 8/9/2019 sas en knime

    19/66

    19

    INPUT … datalines create data inside the script

    SAS KNIME

    data ;input [position] [format]

    [position] [format]...[position] [format];

    datalines;

    ;run;

    Table Creator

    Note . Possible reading formats are: Integer, Double, and String. A date must be created as Stringand later on converted to DateTime type with a “String To Date/Time” node.

    Example:

    data xyz;input ContractID $ CustID $ nr

    @17 ReceivedDate date9.@27 Price;

    format ReceivedDate date9.;datalines;

    A123 1ABCR 34 28jul1998 45.00B123 2ABCDF 98 20jul1997 345.88C123 3ABCADF 121 19may1999 100.99D123 4A 43 10aug1999 27.37E123 5ABC 80 16aug1999 9.65F123 6ERA 6 18jun1999 19.67G123 7ABC 383 09jan2000 14.89H123 8ABC 26 03aug2000 44.50;run;

  • 8/9/2019 sas en knime

    20/66

    20

    connect to /disconnect from connect/disconnect to/from database

    SAS KNIME

    proc sql;

    connect to as ();create table asselect * from connection to

    ();disconnect from ;

    quit;

    Database ConnectorDatabase Reader

    Note 1 . The “Database Connector” node opens the connection to the database and keeps itopen to build an SQL query. A “Database Disconnector” node is not necessary. The “DatabaseReader” node connects to the datab ase, reads the table by means of a SELECT query, anddisconnects.Note 2 . In the SQL query, built after the “Database Connector” node, INSERT and DELETEstatements are not possible.

    Example:

    proc sql; connect to oracle as dblink

    (user= xx x x , password= xxxx , path= ORCL );create table xyz asselect * from connection to oracle

    (select * from employees);disconnect from dblink;

    quit;

  • 8/9/2019 sas en knime

    21/66

    21

    PROC IMPORT … DBMS=EXCEL read data from Excel fileSAS KNIME

    PROC IMPORT OUT= DATAFILE= DBMS=EXCEL REPLACE;

    SHEET=;GETNAMES=YES/NO;MIXED=YES/NO;USEDATE=YES/NO;SCANTIME=YES/NO;

    RUN;

    XLS Reader

    Note . Possible reading formats are: Integer, Double, and String. A date must be read as Stringand later on converted to DateTime type with a “String To Date/Time” node.

    Example:

    PROC IMPORT OUT= work.data1 1DATAFILE= "C:\Months.xls"DBMS=EXCEL REPLACE;

    SHEET="Tabelle1";GETNAMES=YES;MIXED=YES;USEDATE=YES;SCANTIME=YES;

    RUN;

  • 8/9/2019 sas en knime

    22/66

    22

    Read SAS datasets from KNIME

    Read SAS7BDAT (DSREAD)

    KNIME Labs has a dedicated node to read SAS datasets , named “SAS7BDAT(DSREAD)”.To install this extension from the KNIME Labs, in the top menu:

    - Click “Help” - Select “Install New Software …” - Select the update site, like: http://www.knime.org/update/2.4 - Open the “KNIME Labs Extensions” category - Select “KNIME SAS7BDATA Reader (Windows Only)” - Click “Next” and follow the installation instructions

    http://www.knime.org/update/2.4http://www.knime.org/update/2.4http://www.knime.org/update/2.4http://www.knime.org/update/2.4

  • 8/9/2019 sas en knime

    23/66

    23

    Reporting

  • 8/9/2019 sas en knime

    24/66

    24

    PROC REPORT create a reportSAS KNIME

    proc report data=;column ;define / ;;

    run;

    KNIME Reporting Tool

    - In the workflow connect a “Data To Report” node to the data to be displayed inthe report.

    - Right-click the workflow in the “Workflow Projects” panel on the top left. TheKNIME Reporting Tool opens.- Create your report in the KNIME Reporting Tool.

    Example:

    proc report data=xyz nowindows;

    column x y z;

    define x / group format=$xfmt.;define y / analysis sum

    format=dollar9.2;define z / group format=$zfmt.;

    …. run ;

  • 8/9/2019 sas en knime

    25/66

    25

    Grouping / Sorting

  • 8/9/2019 sas en knime

    26/66

    26

    PROC SUMMARY/MEANS calculate statistics for groups of data rows

    SAS KNIME

    proc summary data= ;class = , , … ; var ;OUTPUT ;

    run;

    GroupBy

    The “GroupBy” node is configured via two tabs:- “Groups” defines the group columns- “Options” defines the aggregation variables and the aggregation methods

    Columns selected in “ Groups ” correspond to columns selected by “ class ”. Columns selected in “ Options ” correspond to columns selected by “ var ”. Aggregation methods selected in “ Options ” correspond to :“”.

    Example:

    PROC SUMMARY DATA= xyz; CLASS ContractID; VAR Price; OUTPUT OUT=example1 SUM(Price) = tot_sales MEAN(Price) = mean_sales NMISS(Price) = bad_sales ; RUN;

  • 8/9/2019 sas en knime

    27/66

    27

    PROC TABULATE shape a table with the aggregation values of one or more columnsSAS KNIME

    proc tabulate data= out=;class = , , … ; var ;table *, …, * ;

    run;

    Pivoting

    The “Pivoting” node is configured via three tabs:

    - “Groups” defines the group columns (final row IDs)- “Pivots” defines the pivoting columns (final column headers) - “Options” defines the aggregation variables and the aggregation methods

    Columns selected in “ Groups ” and “ Pivots ” correspond to columns selected by “ class ”. Columns selected in “ Options ” correspond to columns selected by “ var ”. Aggregation methods selected in “ Options ” correspond to “ ”.

    The “Pivoting” node produces three output tables: the pivot table and the total valuesfor column and rows.

    Example:

    proc tabulate data= xyz out=xyz;class = education, gender;var number;table gender, education * N;

    run;

  • 8/9/2019 sas en knime

    28/66

    28

    PROC SORT sort dataSAS KNIME

    PROC SORT DATA= OUT=;BY ;

    RUN ;

    Sorter

    Note . The “Sorter” node has no option to filter out duplicates. To filter out duplicates you needto use a “RowID” or a “Joiner” node.

    Example:

    PROC SORT DATA=xyz OUT=xyz2 ;BY CustID, ContractID, DateReceived ;

    RUN ;

  • 8/9/2019 sas en knime

    29/66

    29

    BY grouping data rows

    SAS KNIME

    Proc Sort data=; by

    … ; data ;

    set ; by … ;

    run;

    Sorter / GroupBy

    The philosophy is a bit different. Normally, a sort is not required before a n equivalent“BY” operation is performed. For data manipulation, use the “GroupBy node” andadditional data manipulation nodes as required.

    The BY statement can be implemented with a “GroupBy” node .“GroupBy” can perform a number of aggregations by the group columns.

    However, we have no information on the first data row and the last data row of eachgroup.If we want to perform some conditional/assignment/operation on the first/last datarow of each group, we need to implement a counter with a “ Java Snippet” node.

    See “counter” below.Example:Proc Sort data=data1;by class gender;data data1;

    set xyz; by class gender;

    run;

  • 8/9/2019 sas en knime

    30/66

    30

    count enumeration variableSAS KNIME

    data ;

    set ;count + 1;run;

    Global Variable in Java Snippet node

    The enumeration variable “count” can be easily obtained using a global variable in a“Java Snippet” node.

    Example:

    data xyz;set xyz;count + 1;

    run;

  • 8/9/2019 sas en knime

    31/66

    31

    first. finding the first row in a row groupSAS KNIME

    data ;set ;count + 1;by … ; if then count = 1;

    run;

    Sorter + counter in Java Snippet

    The first row in a group derived from a “BY” statement can be found by combining a “Sorter”node and a “Java Snippet” node. - The “Sorter” node sorts the data using the “BY” variable as the key; - The “Java Snippet” node stores the “BY” variable value in a global variable and

    compares the new and the old value at every row.- When the “BY” variable has changed value, this is the first row of a new data group.

    Example:

    data xyz;set xyz;count + 1;by sex;if first.sex then count = 0;

    run;

  • 8/9/2019 sas en knime

    32/66

    32

    nmiss() count missing values in a row

    SAS KNIME

    data ;

    set ; = nmiss( , , …, ) ;run;

    Missing Values node + Java Snippet node

    Here is the code for the “Java Snippet” node:

    int count = 0;if ($sex$.equals("unknown")) count++;if($CustID$.equals("unknown")) count++;if($ContractID$.equals("unknown")) count++;return count;

    Note 1 . The “Java Snippet” node alone is not enough to count the number of missing values in arow, because it does not have a missing() function. In this solution, missing values are convertedinto a special value first with the “Missing Value” node and then counted via the “Java Snippet”node.

    Note 2. Since the counting is calculated across columns and not across rows, global variables arenot needed.

    Example:

    data xyz;set data1;nr_missing = nmiss( sex, CustID, ContractID ) ;

    run;

  • 8/9/2019 sas en knime

    33/66

    33

    Join/Concatenate

  • 8/9/2019 sas en knime

    34/66

    34

    SET concatenate data tablesSAS KNIME

    data ;set ;

    run;

    Example:

    data xyzset x y z;

    run;

    Concatenate (Optional in)

    Note . The “Concatenate (Optional in) ” node has a maximum of 4 input. If we need toconcatenate more than four data sets, we need to concatenate more “Concatenate (Optionalin)” node s.

  • 8/9/2019 sas en knime

    35/66

    35

    MERGE merge/join data setsSAS KNIME

    data ; merge ( in= )

    ( in= );by ;

    run;

    Joiner

    Note 1 . “Duplicate Column Handling” is defined in the “Column Selection” tab. Here you can choosebetween: skip duplicate columns, stop the node execution if a column with the same name i s found inboth tables, or append a suffix to the name of duplicate columns.

    Note 2. There is no merging option in handling duplicate columns. That is: unlike “merge” in SAS, the“Joiner” node in KNIME does not fill the missing values in one column in one table with the non -missing values from the column with the same name in the other table. To merge values from twocolumns in the joined data table, you need to use the “Column Merger” or the “Rule Engine” node.

    Example:

    data xyz; merge sales( in=a )

    contracts( in=b );by id;

    run;

  • 8/9/2019 sas en knime

    36/66

    36

    WHERE row filter

    SAS KNIME

    proc … data=…

    where ;run;

    data set ;

    where ;run;

    Row Filter

    Note . You need to build the condition in the configuration window of the “Row Filter” node.Notice that it is also possible to EXCLUDE the rows matching the pattern.

    Example:

    proc means data=xyz; where x3;

    run;

    data xyz;set xyz;

    where ContractID=123;run;

  • 8/9/2019 sas en knime

    37/66

    37

    KEEP/DROP column filter

    SAS KNIME

    proc … data=… keep … ;

    [ drop … ;] run;

    data set keep … ;

    [ drop … ;] run;

    Column Filter

    Example:

    proc means data=xyz;keep x, y;

    [ drop z;]run;

    data xyz;set xyz;keep ContractID, CustID, Price;

    [ drop nr,DateReceived;]run;

  • 8/9/2019 sas en knime

    38/66

    38

    String Manipulation

    Note. Starting from KNIME 2.5 a new general String Manipulation node will make available in the same node a number of additional stringmanipulation operations.

  • 8/9/2019 sas en knime

    39/66

    39

    left()/right()/strip()/trim() remove blanks on the left/right/everywhere/on the sides

    SAS KNIME

    data ;set ; = left( ) ; = right( ) ; = strip( ) ; = trim( ) ;

    run;

    String Replacerwith a blank as wildcard pattern

    Note . The “String Replacer” node acts as strip().

    trim(). There is not a specific node for trim() , but you can use a “Java Snippet” node (“return$CustID$.trim( );”).

    left()/right(). There is not a specific node for left() and right() . The same effect can be obtainedthrough a combination of the “String Replacer” node and one of the “Cell Splitter” nodes .

    Example:

    data xyz;set data1;CustID = left( CustID ) ;CustID = right( CustID ) ;CustID = strip( CustID ) ;CustID = trim( CustID ) ;

    run;

  • 8/9/2019 sas en knime

    40/66

    40

    upcase()/lowcase()/propcase() convert a string from lowcase to upcase and vice versa

    SAS KNIME

    data ; = upcase( ) ; = lowcase( ) ; = propcase( ) ;

    run;

    Case Converter

    Note . The “Case Converter” node does not have an option for propcase(), to capitalize only thefirst letter of each word.

    Example:

    data xyz;ContractID = upcase( ContractID ) ;ContractID = lowcase( ContractID ) ;ContractID = propcase( ContractID ) ;

    run;

  • 8/9/2019 sas en knime

    41/66

    41

    substr() String Replacer and Cell SplittersSAS KNIME

    1. substr ( , ,< length >) = characters-to-replace

    OR

    2. = substr( < string>, ,< length >)

    1. String Replacer

    Note . The “String Replacer” node allows the use of wildcards “*”

    2. Cell Splitter By Position

    Note . KNIME has 3 “Cell Splitter” nodes: “Cell Splitter by Position” based on indices, “CellSplitter” based on delimiter matching, and “Regex Split” based on Regular Expression matching.

    Example:

    1. a='KNIME';substr( a,4,1 ) ='F';

    RESULT: a = “KNIFE“

    OR

    2. newStr='Information Mining';newStr1= substr( newStr,1,4 ) ;

    newStr2= substr( newStr,12,6 ) ;

    RESULT: newStr1 = “Info” newStr2 = “Mining”

  • 8/9/2019 sas en knime

    42/66

    42

    scan() find words in a stringSAS KNIME

    data ;set ; = scan( < string>, , ,

    ) ;end;

    run;

    Cell Splitter

    Example:

    data xyz;

    set data1;delim = „, ; modif = „mo ; nwords = countw ( string, delim, modif ) ;

    do count = 1 to nwords;word = scan( string, count, delim, modif ) ;output;

    end;run;

  • 8/9/2019 sas en knime

    43/66

    43

    countw() count words in a stringSAS KNIME

    data ;set ;

    = countw( < string>, , ) ; ) ;end;

    run;

    Java Snippet

    There is no node in KNIME that is specialized to count words in a string with a givendelimiter. However, you can use a “Java Snippet” node with the following code , where$string-column$ contains the strings where to count the words.

    Example:

    data xyz;set data1;delim = „, ; modif = „mo ; nwords = countw( string, delim, modif ) ;

    do count = 1 to nwords;word = scan(string, count, delim, modif) ;output;

    end;run;

  • 8/9/2019 sas en knime

    44/66

    44

    index ()/indexw() find index of first occurrence of pattern in string values

    SAS KNIME

    data ;set ; = index( ) ; = indexw( ) ;

    run;

    Row Splitter node + 2 Rule Engine nodes + Concatenate node

    Note . The “Rule Engine” node cannot alone simulate index()/indexw(), because itallows neither pattern matching nor wildcards in the string comparison.

    Example:

    data xyz;set data1;found = index( string, “Mining” ) ;found = indexw( string, “Mining”, “%” ) ;

    run;

  • 8/9/2019 sas en knime

    45/66

    45

    put() type conversionsSAS KNIME

    … put()

    source = ;ff = ; = put( source, format ) ; …

    Example:

    … source = “12.01.2011” ;ff= date8.;source_date = put(source, date8.);…

    Rename

    The „Rename“ node alone offers a subset of type conversion utilities:

    For more specific type conversions you can use one of the following nodes:

    Number To StringString To Number

    Double To Int

    String To Date/TimeTime To String

  • 8/9/2019 sas en knime

    46/66

    46

    PROC FORMAT format integer ranges or string valuesSAS KNIME

    proc format library=;

    value [range1]= text1'[range2]='text2'[range3]='text3' ;

    value $ ='text4'='text5' ;

    … run;

    In Report:In Tab „Property Editor” (panel under the Report Layout editor) :

    - select Tab “Map” to display values or ranges as text strings

    In workflow:

    Cell Replacer

    String Replace (Dictionary)

    Rule Engine

    Note . A group of nodes, however, can work like a macro - that is it can be recalled at any pointof the workflow - only in the professional version: the KNIME Server. In the Desktop version (thefree version) you need to recreate the same sequence of nodes to re-execute the sameoperation.

    Example:

    proc format library=my;

    value education 1-11 = 'till High School'12 = 'High School'

    12

  • 8/9/2019 sas en knime

    47/66

    47

    Logical Operations

  • 8/9/2019 sas en knime

    48/66

    48

    IF if- then construct

    SAS KNIME

    data ; set ;if then ; if then ; else ;

    .....run;

    Rule Engine

    Note . The consequent in the “Rule Engine” node can either be a String or the value in anothercolumn.

    Example:

    data xyz;

    set xyz;

    if Price < 50 then label= discounted ;else if Price > 200 then label = „overpriced ; else label= normal price d ;

    .....run;

    http://javeeh.net/sasintro/intro59.htmlhttp://javeeh.net/sasintro/intro59.htmlhttp://javeeh.net/sasintro/intro59.htmlhttp://javeeh.net/sasintro/intro59.html

  • 8/9/2019 sas en knime

    49/66

    49

    in() find whether one specific value belongs to a list of valuesSAS KNIME

    data ;set ;if/where in( )

    [then ];

    run;

    Rule Engine (IN( …))

    Example:

    data xyz;set xyz;if class in

    ( „A , B , C ) then prediction = „Y ;

    run;

  • 8/9/2019 sas en knime

    50/66

    50

    missing() find missing values

    SAS KNIME

    data ;set ;if missing( ) then ;

    run;

    Rule Engine (MISSING)

    Example:

    data xyz;set data1;if missing( ContractID ) then label=CustID;else label=ContractID;

    run;

  • 8/9/2019 sas en knime

    51/66

    51

    do … end loopsSAS KNIME

    data do

    … end ;

    run;

    …Loop Start / … Loop End

    In the „Flow Control“ -> „Loop Support” category, KNIME offers a number of different loop nodesto start and to end a loop. All loops in KNIME start with a “ Loop Start” node and end witha “ Loop End” node. All nodes between the loop start node and the loop end node makethe loop body.

    - Cycle “f or ” with a fixed number of iterations . The “ Counting Loop Start ” node , in combinationwith a “ Loop End ” node, implements a “for” cycle.

    - Cycle “w hile ”. The “Generic Loop Start ”, in combination with a “ Variable Condition Loop(End)”, implements a “while” cycle.

    - Looping on a list of values . The “TableRow To Variable Loop Start ” node, together with a

    “Loop End ” node , loops across all rows of a table and transforms each row into a flow variable.This is particularly useful, for example, to loop on a list of unique values.

    - Looping on chunks of data rows. The “Chunk Loop Start ” loops across chunks of the inputrows. The chunking can be set as either a fixed number of chunks (which is equal to thenumber of iterations) or a fixed number of rows per chunk/iteration.

    - Cycle “f or ”, between max and min with step size. Finally, the “ Interval Loop Start ” nodeimplements a “for” cycle between a start and an end point and using a predefined step size.

    Example:

    data iris;do species=1 to 3;

    … end ;

    run;

    http://javeeh.net/sasintro/intro86.htmlhttp://javeeh.net/sasintro/intro86.html

  • 8/9/2019 sas en knime

    52/66

    52

    macro group nodes into a metanode

    SAS KNIME

    %macro (par1=value1, par2=value2);… Code

    … %mend ;

    Metanode

    In KNIME nodes can be grouped together inside a meta-node like code lines inside a macro in SAS.Just select the meta-nodes, right- click and select “Coll apse into Meta Node”.

    Alternatively you can create a meta- node from scratch by clicking the “Add Meta Node Wizard”button in the top bar and following the meta-node wizard instructions.

    Input parameters can be introduced in a meta-node by means of the “Quick Form” nodes.

    Note . The same meta-node can be stored centrally and called by different workflows, workspaces,and even users. This meta-node sharing feature is only available in the KNIME Server.

    Example:

    %macro finance(yvar=expenses,xvar=division);proc plot data=yearend;plot &yvar*&xvar;

    run;%mend finance;

  • 8/9/2019 sas en knime

    53/66

    53

    Date/Time Manipulation

  • 8/9/2019 sas en knime

    54/66

    54

    day() / year() / month() date field extractorSAS KNIME

    DATA ;INPUT

    ;M = MONTH( ) ;

    D = DAY( ) ;Y = YEAR( ) ;wk_d= WEEKDAY( ) ;q = QTR( ) ;

    … ;RUN;

    Date Field Extractor

    Note . The “Date Field Extractor” node works on DateTime data type. Since a date is always readas a String, it has to be converted to the DateTime type before the “Date Field Extractor” node isapplied. For the conversion, use a “String to Date/Time” node.

    Example:

    DATA dates;INPUT name $ 1-4 @6 bday mmddyy11.;

    M = MONTH( bday ) ;D = DAY( bday ) ;Y = YEAR( bday ) ;wk_d= WEEKDAY( bday ) ;q = QTR( bday ) ;

    … ;

    RUN;

  • 8/9/2019 sas en knime

    55/66

    55

    today() calculate and set today’s date SAS KNIME

    data ; = today() ;

    run;

    Time Difference

    Sometimes today() is used to calculate the time between some date and today. To calculatedate and time differences use the “Time Difference” node with option “Use execution time”enabled.

    Java Snippet

    If you just need a timestamp for your data, you can use a “Java Snippet” node with the following

    code: return new Date().toString();

    Example:

    data _null_;current_date = today() ; if (current_date - datedue) > 30 then

    do;put 'As of ' current_date date9.

    ' the payment of invoice #'Invoice_nr 'is expired.';

    end;run;

  • 8/9/2019 sas en knime

    56/66

    56

    mdy() build date from (month, day, year)

    SAS KNIME

    data ;set ; = mdy( ,, ) ;

    run;

    Column Combiner node + String To Date/Time node

    Starting from 3 columns „day“, „month”, and “year”, combine them with the “ColumnCombiner” node using the ‘.’ as separator character, for example.Using a “String To Date/Time” node convert the combined string to a DateTime type.

    Note . The DateTime format in the menu of the “String To Date/Time” node is editable. If you donot find the format that you need in the menu, you can always edit one of the available formats.

    Note . For a conversion to DateTime you need day and month in a two- char format, like “03” forMarch. If you start from day and month as Integer, the “Number To String” node does notsupply a fixed format with zero-padding for the conve rsion. So, Integer “3” becomes String “3”while you need “03” for a conversion to DateTime. To format day and month as two -char stringsyou need to use a “String Replacer” node .

    Example:

    data xyz;set data1;birthday = mdy( 11, 13, 1965 ) ;

    run;

  • 8/9/2019 sas en knime

    57/66

    57

    Math Operations

    i () il() d() fl () bl i

  • 8/9/2019 sas en knime

    58/66

    58

    int()/ceil()/round()/floor() Double to Int conversion

    SAS KNIME

    data ;set = int( < matrix> ) ; = round( < matrix> ) ; = floor( < matrix> ) ; = ceil( < matrix> ) ;

    run;

    Double To Int

    Note . int() in SAS is a mixture between ceil() and floor(). To calculate int() in KNIME you canuse:

    - A “Java Snippet” node with a cast (int) - A “Cell Splitter” node using ‘.’ or ‘,’ as the delimiter character , followed by a “String To

    Number” node .

    Example:

    data xyz;set data1;

    Price = int( Price ) ;Price = round( Price ) ;Price = floor( Price ) ;Price = ceil( Price ) ;

    run;

    ()/ i () fi d i d i i l i

  • 8/9/2019 sas en knime

    59/66

    59

    max()/min() find maximum and minimum value in a rowSAS KNIME

    data ;set ; =

    max/min( < matrix_1> , …, ) ;run;

    Math Formula

    Note . The “Math Formula” node is not part of the basic KNIME Desktop. It is part of the extension package“KNIME Math Expression Extension (JEP)” that must be installed vi a:

    - “Help” -> “Install new Software ”

    - Select an update site- Select “KNIME & Extensions”- Select “KNIME Math Expression Extension (JEP)” package - Click “Next”

    Java Snippet return Math.max/Math.min ($col_1$, $col_2$ , …, $col_n$);

    Example:

    data xyz;set data1;y = max( nr, Price ) ;

    run;

  • 8/9/2019 sas en knime

    60/66

    ()/ () g / l

  • 8/9/2019 sas en knime

    61/66

    61

    mean()/sum() average/sum across row valuesSAS KNIME

    data ;set ;

    = mean/sum( col_1, col_2, …, col_n ) ; = mean/sum( of col1-coln ) ; = mean/sum( of col_1, col_2, … col_n ) ;

    run;

    Math Formula

    Note 1 . There is no specific sum() function in the “Mathematical Function” list ed in the “Math

    Formula” node. However, you can easily build one by multiplying the “average()” function bythe number of its arguments.

    Note 2 . The “Math Formula” node is not part of the basic KNIME Desktop. It is part of theextension package “KNIME Math Expression Extension (JEP)” that must be installed via:

    - “Help” -> “Install new Software” - Select an update site- Select “KNIME & Extensions” -> “KNIME Math Expression Extension (JEP)” package - Click “Next”

    Example:

    data xyz;set xyz;avg = mean( x, y, z ) ;avg = mean( of x1-x10 ) ;avg = mean( of x, y, z ) ;

    run;

  • 8/9/2019 sas en knime

    62/66

    62

    TRANSPOSE transpose data table

    SAS KNIME

    proc transpose data= out=;[by ;]

    run;

    Transpose

    Note . The “Transpose” node performs a simple transposition (that is, without the “BY”statement) of the input data table. It has no configuration settings, besides a few memoryoptions.

    Example:

    proc transpose data=indata out=outdata;

    by location date;run;

  • 8/9/2019 sas en knime

    63/66

    63

    Basic Statistics

    PROC FREQ statistical variables from two columns plus chi square and Fisher test

  • 8/9/2019 sas en knime

    64/66

    64

    PROC FREQ statistical variables from two columns plus chi-square and Fisher testSAS KNIME

    proc freq data = ;

    tables * / ;run;

    Crosstab

    The default freq calculations and the statistics of the proc freq can be found in the“Crosstab” node.

    The“Crosstab” node analyzes the relation of two columns with categorical data bydisplaying the frequency distribution (N, %, deviation, cell chisquare, etc… ) of thecategorical variables in one of the output tables.The “Crosstab” node also provides chi-square test statistics and, in case of a crosstabulation of dimension 2x2, the Fisher's exact test. The results of these tests aredisplayed in the second output table.

    Statistics

    The node “statistics” also calculates a number of statistics variables for theinput data table columns.

    Example:

    proc freq data = xyz;tables class*value / chisq;TITLE 'Simple Example of PROC FREQ';

    run;

    Default freq calculations:N, %, cumulative N, cumulative %

    The “ chisq ” option to the tables statement performs the

    standard Pearson chi-square test on the table(s)requested.The “ expected ” option prints the expected number ofobservations in each cell under the null hypothesis.The “ exact ” option requests Fisher's exact test for thetable(s). This is automatically computed for 2 x 2 tables.

  • 8/9/2019 sas en knime

    65/66

    65

    The KNIME Booklet for SAS Users

    From a personal experience, I know how difficult it can be to switch from one software

    tool to another. Even though both tools provide the same functionalities, a change ofmind set is needed to discover where and how such functionalities are implemented inthe new software tool.

    This book is a quick guide on how to use KNIME for users coming from the SASexperience. It is not an introduction to KNIME, since it is assumed that the user isalready familiar with the basic concepts of data manipulation, analysis, and reporting inKNIME.It is more a map of the most commonly used SAS functions and techniques into theirKNIME equivalents.

    For a complete introduction to KNIME, please refer to my book “KNIME Beginner’s Luck”available from KNIME Press under www.knime.com/press

    About the Author

    Dr Rosaria Silipo has been mining data since her master degree in 1992. She kept miningdata throughout all her doctoral program, her postdoctoral program, and most of her

    following job positions. She has many years of experience in data analysis, reporting,business intelligence, training, and writing. In the last few years she has been using KNIME

    for her consulting work, becoming a KNIME certified trainer and an expert in the KNIMEReporting tool.

    ISBN: 978-3-9523926-1-4

  • 8/9/2019 sas en knime

    66/66