Data Management and Preparation

Data Management and Preparation is the foundation of any analytical workflow. In the Certificate in Data Analytics with SAS, learners encounter a wide range of terminology that describes how raw information is turned into a reliable, analys…

Data Management and Preparation

Data Management and Preparation is the foundation of any analytical workflow. In the Certificate in Data Analytics with SAS, learners encounter a wide range of terminology that describes how raw information is turned into a reliable, analysis‑ready asset. The following explanation covers the most important terms, provides practical examples using SAS, and discusses common challenges that arise in real‑world projects. The goal is to give students a vocabulary they can apply immediately when they begin to import, clean, transform, and load data.

Data Source refers to the origin of information. Sources can be relational databases, flat files such as CSV or Excel, web services, sensor streams, or even unstructured text logs. In SAS, a source is accessed through a library definition or an access engine. For example, a CSV file located on a network share can be read with a DATA step that uses the INFILE statement, while a SQL Server table can be accessed using PROC SQL together with the SAS/ACCESS Interface to ODBC.

SAS Library is a logical reference that points to a physical location where SAS data sets are stored. A library is created with the LIBNAME statement, and it can refer to a directory on the file system, a database, or a remote server. Example:

``` Libname mydata '/folders/myfolders/project_data'; ```

The library name (mydata) becomes a prefix that allows the programmer to reference any data set within that location, such as mydata.Sales_2023.

Data Set in SAS is a collection of observations (rows) and variables (columns) that share a common structure. Each observation represents a single record, while each variable holds a specific attribute of that record. A data set is analogous to a table in a relational database. The term observation is interchangeable with record or case, and the term variable aligns with field or column.

Variable can be numeric, character, or date/time. SAS stores numeric values in floating‑point format, while character values are stored as strings of a defined length. The length and type are defined by the INPUT statement or by a PROC IMPORT step that reads external data. For instance:

``` Data mydata.Customers; Infile 'customers.Csv' dlm=',' dsd firstobs=2; Input CustomerID $ Name $ Age Salary; Run; ```

In this example, CustomerID and Name are character variables, while Age and Salary are numeric.

Metadata describes the structure, meaning, and provenance of data. In SAS, metadata can be obtained through PROC CONTENTS, which lists variable names, types, lengths, formats, and labels. A data dictionary is a more detailed document that explains each variable’s business meaning, permissible values, and data quality expectations. Maintaining accurate metadata is essential for data governance and for ensuring that downstream analysts correctly interpret the data.

Import is the process of bringing data from an external source into a SAS data set. Import can be performed with a DATA step using INFILE, with PROC IMPORT for delimited or Excel files, or with PROC SQL when connecting to a database. The choice of method depends on the source format, the need for control over data types, and performance considerations. For example, PROC IMPORT automatically determines variable types based on the first few rows, which can lead to misclassification if the sample is not representative.

Export is the reverse operation: Writing a SAS data set to an external format such as CSV, Excel, or a database table. PROC EXPORT or the LIBNAME engine for Excel can be used. When exporting, attention must be paid to preserving formats, handling special characters, and ensuring that missing values are represented consistently (e.G., As blanks or as a specific sentinel value).

ETL stands for Extract, Transform, Load. It describes the three‑stage pipeline that moves data from source systems to a target analytical environment. Extraction pulls data from one or more sources; transformation cleans, reshapes, and enriches the data; loading writes the transformed data to the destination. In SAS, ETL can be orchestrated using batch scripts, SAS stored processes, or workflow tools such as SAS Data Integration Studio. A typical ETL flow might extract sales transactions from an Oracle database, standardize product codes, calculate revenue per region, and then load the aggregated results into a SAS data set used for reporting.

Data Integration expands on ETL by emphasizing the combination of disparate data sources into a unified view. Integration often involves joining tables from different systems, reconciling differing naming conventions, and resolving duplicate records. In SAS, integration can be performed with PROC SQL joins, the DATA step MERGE statement, or the MATCH MERGE technique that aligns observations by key variables.

Primary Key is a variable or set of variables that uniquely identifies each observation within a data set. Primary keys are essential for reliable joins and for preventing duplicate records. For example, an OrderID column in an orders table serves as a primary key. In SAS, primary keys are not enforced automatically, so the programmer must ensure uniqueness through validation steps such as PROC SORT with the NODUPKEY option.

Foreign Key is a variable that references the primary key of another data set, establishing a relationship between the two. For instance, a CustomerID in the orders data set is a foreign key that links each order to a customer record. Maintaining referential integrity requires that every foreign key value exists in the referenced primary key data set; this can be checked with PROC SQL NOT EXISTS queries.

Join operations combine two data sets based on matching key values. The most common join types are inner join, left join, right join, and full outer join. An inner join returns only rows where keys match in both tables; a left join returns all rows from the left table and matching rows from the right; a right join does the opposite; a full outer join returns all rows from both tables, inserting missing values where no match exists. In SAS, joins are expressed with PROC SQL:

``` Proc sql; Create table work.Sales_customers as Select a.*, B.Name, b.Segment From work.Sales a Left join work.Customers b On a.CustomerID = b.CustomerID; Quit; ```

Here, the left join ensures that every sales record appears, even if the corresponding customer information is missing.

Concatenation (also called stacking) appends observations from one data set to another that has the same variables. In SAS, the SET statement performs concatenation:

``` Data work.All_sales; Set work.Sales_q1 work.Sales_q2 work.Sales_q3 work.Sales_q4; Run; ```

All quarterly sales data sets share the same structure, so they can be combined into a single annual data set.

Data Profiling is the systematic examination of data to understand its content, structure, and quality. Profiling answers questions such as: What are the distinct values for a variable? What is the distribution of numeric fields? Are there unexpected missing values? SAS provides several procedures for profiling: PROC FREQ for categorical frequency, PROC MEANS for numeric summary, PROC UNIVARIATE for detailed distribution analysis, and PROC CONTENTS for structural metadata. A profiling workflow might look like this:

``` Proc freq data=work.Customers; Tables Gender*Region / missing; Run;

Proc means data=work.Customers n nmiss mean std min max; Var Age Salary; Run; ```

The output helps analysts detect anomalies such as a gender field containing values other than “M” or “F”, or an age variable with negative numbers.

Data Quality encompasses dimensions such as completeness, accuracy, consistency, timeliness, uniqueness, and validity. Completeness measures whether required fields are populated; accuracy assesses how close values are to the true world; consistency checks that related fields do not conflict; timeliness evaluates whether data is current; uniqueness ensures no duplicate records; validity confirms that values conform to defined rules (e.G., A zip code matches a known pattern). Data quality assessments are often built into data preparation pipelines using validation rules and exception handling logic.

Missing Values in SAS are represented by a period (.) For numeric variables and a blank for character variables. Missing values can arise from non‑responses, data entry errors, or source system limitations. SAS treats missing values as the smallest possible numeric value in sorting, which can affect calculations if not handled explicitly. Strategies for dealing with missing data include:

- Deleting observations with missing values (if the proportion is small and the loss is acceptable). - Imputing missing values with the mean, median, or mode of the variable. - Using predictive modeling to estimate missing values. - Flagging missing values for downstream analysis.

PROC STDIZE provides a simple way to replace missing numeric values with the mean or median:

``` Proc stdize data=work.Sales out=work.Sales_imputed method=mean reponly; Var Quantity Price; Run; ```

The REPOONLY option replaces only missing values, leaving existing data untouched.

Outlier Detection identifies observations that deviate markedly from the majority of the data. Outliers can indicate data entry errors, measurement anomalies, or genuine extreme cases. Techniques for outlier detection include visual inspection with box plots, statistical tests such as the interquartile range rule, or robust methods like the median absolute deviation. In SAS, PROC UNIVARIATE can generate box plots and calculate the 1.5*IQR thresholds:

``` Proc univariate data=work.Sales; Var Revenue; Histogram Revenue / normal; Inset mean std / position=ne; Run; ```

If outliers are confirmed as errors, they may be corrected or removed; if they are legitimate extreme values, analysts may choose to Winsorize them (cap at a defined percentile) to reduce their influence on statistical models.

Data Transformation refers to any operation that changes the shape, scale, or representation of data. Common transformations include:

- Recoding categorical values (e.G., Mapping “Yes”, “Y”, “1” to a single code). - Standardization (z‑score) to center numeric variables around zero with unit variance. - Normalization to rescale variables to a bounded range, such as 0–1. - Aggregation to summarize data at a higher level (e.G., Daily sales to monthly totals). - Pivoting (wide to long) or transposition (long to wide) to reshape data for modeling.

SAS provides a variety of tools for transformation. The DATA step can recode values using IF‑THEN logic:

``` Data work.Cleaned; Set work.Raw; If Gender in ('M','Male') then Gender_Code = 1; Else if Gender in ('F','Female') then Gender_Code = 0; Else Gender_Code = .; Run; ```

PROC STANDARD can standardize numeric variables:

``` Proc standard data=work.Sales mean=0 std=1 out=work.Sales_std; Var Revenue; Run; ```

PROC TRANSPOSE pivots data:

``` Proc transpose data=work.Monthly_sales out=work.Wide_sales prefix=Month_; By ProductID; Id Month; Var Sales; Run; ```

Standardization (also called z‑score scaling) transforms a variable by subtracting its mean and dividing by its standard deviation. This makes the variable unit‑less and suitable for algorithms that assume comparable scales, such as k‑means clustering or principal component analysis. In SAS, PROC STDIZE with the METHOD=STD option accomplishes this:

``` Proc stdize data=work.Training out=work.Training_std method=std; Var Age Income; Run; ```

Normalization rescales a numeric variable to a specified range, often 0 to 1. Normalization is useful when the model is sensitive to absolute magnitudes, such as neural networks that use activation functions bounded between 0 and 1. A manual normalization can be performed in a DATA step:

``` Proc sql noprint; Select min(Price), max(Price) into :Min_price, :Max_price From work.Products; Quit;

Data work.Products_norm; Set work.Products; Price_Norm = (Price - &min_price) / (&max_price - &min_price); Run; ```

Imputation replaces missing values with estimated ones. Simple imputation methods include mean, median, or mode substitution. More sophisticated approaches use regression models, k‑nearest neighbors, or multiple imputation to preserve variability. SAS’s PROC MI implements multiple imputation, generating several complete data sets that reflect uncertainty about the missing values:

``` Proc mi data=work.Sales out=work.Sales_imp seed=12345; Var Quantity Price; Run; ```

Each resulting data set can be analyzed separately, and results are combined using PROC MIANALYZE to produce final estimates that incorporate imputation variance.

Data Validation checks that data conforms to business rules before it is used for analysis. Validation can be performed at load time (e.G., Using SAS Data Integration Studio’s validation rules) or within a SAS program using conditional logic. Typical validation rules include:

- Range checks (e.G., Age must be between 0 and 120). - Format checks (e.G., Date must follow YYYY‑MM‑DD). - Consistency checks (e.G., EndDate must be later than StartDate). - Referential integrity (foreign key values must exist in the parent table).

A simple validation routine might look like this:

``` Data work.Validated; Set work.Raw; If Age < 0 or Age > 120 then do; Put "ERROR: Invalid Age for ID=" ID; Delete; End; If notdigit(Phone) then do; Put "WARNING: Non‑numeric Phone for ID=" ID; End; Run; ```

The PUT statement writes messages to the SAS log, allowing data stewards to review and correct problematic records.

Data Cleaning is the broader process of detecting and correcting errors, inconsistencies, and redundancies. Cleaning tasks often involve:

- Removing duplicate records (PROC SORT with NODUPKEY). - Trimming leading and trailing blanks (TRIM, STRIP functions). - Converting case (UPCASE, LOWCASE). - Replacing illegal characters (TRANWRD). - Standardizing date formats (INPUT with appropriate informat).

Example of duplicate removal:

``` Proc sort data=work.Customers nodupkey out=work.Customers_unique; By CustomerID; Run; ```

Example of trimming and case conversion:

``` Data work.Cleaned_names; Set work.Raw; Name = strip(Name); Name = propcase(Name); Run; ```

The PROPCASE function capitalizes the first letter of each word, improving readability.

Data Type in SAS determines how a variable is stored and how operations are performed on it. The main types are numeric, character, date, datetime, and time. SAS date values are stored as the number of days since January 1, 1960; datetime values are stored as seconds since that same epoch. Formats and informats control how these internal values are displayed and read. For example, a date stored as 21916 (which corresponds to 2020‑02‑15) can be displayed as “15FEB2020” using the DATE9. Format:

``` Data work.Example; Input OrderDate :Date9.; Format OrderDate date9.; Datalines; 15FEB2020 ; Run; ```

Understanding the distinction between internal storage and external representation is crucial for accurate data manipulation, especially when converting between string dates and SAS date values.

Format is a display attribute that tells SAS how to present a variable’s value in output. Formats do not change the underlying data. Common formats include DATE9., DATETIME19., COMMA12.2 For numeric values with thousands separators, and $CHAR20. For character strings of a fixed length. Formats can be assigned in a DATA step or via PROC FORMAT for custom definitions.

Informat is the counterpart to a format; it tells SAS how to read raw data into an internal value. For example, the YYMMDD10. Informat reads a string like “2021‑12‑31” and converts it to a SAS date. When importing data, selecting the correct informat prevents misinterpretation of dates, times, and numeric values.

PROC FORMAT enables creation of user‑defined formats that map raw values to more meaningful labels. For instance, a status code of 1, 2, 3 can be displayed as “Active”, “Inactive”, “Pending”. This improves readability of reports and reduces the need for repetitive CASE statements. Example:

``` Proc format; Value statusfmt 1 = 'Active' 2 = 'Inactive' 3 = 'Pending'; Run;

Data work.Customers; Set work.Customers_raw; Format Status statusfmt.; Run; ```

Indexing creates a data structure that speeds up data retrieval based on a key variable. In SAS, indexes can be defined with the INDEX= option in PROC SORT or with PROC DATASETS. Indexes are especially valuable for large data sets where frequent subsetting on the indexed variable occurs. However, indexes increase storage requirements and may slow down data loading, so they should be used judiciously.

``` Proc datasets lib=work nolist; Modify sales; Index create CustomerID; Quit; ```

After the index is created, a WHERE clause that filters on CustomerID will be executed more efficiently.

Sorting arranges observations in a specified order, typically required before merging data sets or before performing certain statistical procedures. PROC SORT is the primary SAS procedure for sorting. The BY statement defines one or more variables that determine the sort order. The NODUPKEY option removes duplicate observations based on the BY variables.

``` Proc sort data=work.Sales out=work.Sales_sorted; By CustomerID OrderDate; Run; ```

Sorting can be a performance bottleneck for very large data sets; SAS provides options such as TAGSORT (which uses a temporary tag file) and MEMSIZE tuning to mitigate this.

Data Compression reduces the physical size of SAS data sets on disk. SAS supports two main compression methods: CHAR (character‑based) and BINARY (binary‑based). Compression can be enabled at the data set level with the COMPRESS= option in a DATA step or at the library level with the COMPRESS= option in the LIBNAME statement. While compression saves storage and can improve I/O performance for read‑heavy workloads, it may increase CPU usage during write operations.

``` Libname mydata '/folders/myfolders/project_data' compress=binary; ```

Data Sampling creates a subset of the data for exploratory analysis or model development. Sampling techniques include simple random sampling, stratified sampling (where the sample preserves the proportion of categories), and systematic sampling. In SAS, PROC SURVEYSELECT provides a flexible framework for sampling:

``` Proc surveyselect data=work.Sales out=work.Sample method=srs Sampsize=0.1 Seed=98765; Run; ```

The SAMPSIZE=0.1 Argument selects 10 % of the observations randomly. For stratified sampling, the STRATA statement can be added to ensure each segment is represented proportionally.

Data Partitioning splits a data set into training, testing, and validation subsets for predictive modeling. Common splits are 70 % training, 15 % testing, and 15 % validation. Partitioning can be performed using random numbers and the WHERE clause, or with PROC SURVEYSELECT using the OUTALL option to retain the entire data set with a selection flag.

``` Proc surveyselect data=work.Full_data out=work.Partitioned seed=12345 Samprate=0.7 Outall; Run;

Data work.Train work.Test; Set work.Partitioned; If Selected = 1 then output work.Train; Else output work.Test; Run; ```

The variable Selected indicates whether an observation was chosen for the training sample. The remaining observations become the test set.

Data Anonymization removes or masks personally identifiable information (PII) to protect privacy while retaining analytical value. Techniques include masking (replacing values with asterisks), hashing (creating a non‑reversible identifier), and data perturbation (adding random noise). In SAS, the HASH object can generate consistent hashed IDs for linking records across tables without exposing original identifiers.

``` Data work.Anonymized; Set work.Customers; Length HashID $32.; Rc = hashobj.Add(key: CustomerID, data: HashID); If rc = 0 then HashID = hashobj.Hash(); Drop CustomerID; Run; ```

Anonymization is a critical step when handling health data, financial records, or any dataset subject to regulations such as GDPR or HIPAA.

Data Governance encompasses policies, standards, and processes that ensure data is managed as a strategic asset. Core governance roles include:

- Data Owner: The business individual accountable for data quality and usage. - Data Steward: The operational person who maintains data definitions, metadata, and resolves quality issues. - Data Custodian: The technical staff responsible for storage, security, and backup.

In a SAS environment, governance is supported by tools such as SAS Metadata Server, which stores information about libraries, users, and access permissions. Proper governance helps prevent “data silos”, ensures compliance, and supports reproducibility of analytical results.

Data Lineage tracks the flow of data from source to destination, documenting each transformation step. Lineage diagrams are valuable for impact analysis, debugging, and audit trails. SAS Data Integration Studio automatically captures lineage metadata when designing ETL jobs, allowing analysts to trace back any field to its original source.

Data Backup and Archiving protect against loss and provide historical snapshots. Backup copies are typically stored on separate media or cloud storage and refreshed on a scheduled basis. Archiving moves older, rarely accessed data to cheaper storage tiers while preserving the ability to retrieve it if needed. SAS data sets can be copied with PROC COPY or by simply using the X command to invoke operating‑system copy utilities.

``` Proc copy in=work out=archive; Select historic_sales; Run; ```

Data Versioning records changes to data sets over time, enabling rollback to previous states and supporting reproducible research. Versioning can be implemented manually by appending a date or version suffix to data set names, or automatically using SAS’s built‑in revision control features in SAS Enterprise Guide. A practical versioning scheme might be:

``` %Let today = %sysfunc(today(), yymmddn8.); Data work.Sales_&today.; Set work.Sales_latest; Run; ```

Each run creates a new data set with the current date, preserving a chronological history of the sales data.

Data Scheduling and Orchestration automate the execution of data pipelines. SAS can schedule jobs using the SAS Management Console, the command‑line scheduler (cron on Unix, Task Scheduler on Windows), or third‑party workflow tools such as Apache Airflow. Scheduling ensures that data extraction occurs after source system updates, that transformations run on off‑peak hours for performance reasons, and that results are delivered to downstream users on a predictable timetable.

Batch Processing runs a series of SAS programs without interactive intervention. Batch jobs are commonly used for overnight data loads, large‑scale transformations, or generating scheduled reports. A batch command on a Windows system might look like:

``` Sas -sysin "C:\SAS\jobs\load_sales.Sas" -log "C:\SAS\logs\load_sales.Log" ```

Batch processing enables efficient use of computing resources and reduces the risk of manual errors.

Real‑Time Processing handles data as it arrives, providing immediate analysis or alerts. While SAS is traditionally batch‑oriented, real‑time capabilities can be achieved through SAS Event Stream Processing (ESP) or by integrating SAS with streaming platforms such as Apache Kafka. Real‑time pipelines must address latency, ordering, and fault tolerance, and they often require lightweight transformations to maintain throughput.

Data Governance Frameworks such as DAMA‑DMBoK (Data Management Body of Knowledge) provide structured approaches to define data policies, data quality metrics, and stewardship responsibilities. In a SAS‑centric curriculum, students learn how to map these frameworks to SAS tools: Metadata management in SAS Metadata Server, data quality monitoring with SAS Data Quality Server, and access control via SAS Object Spawner.

Data Quality Rules are specific checks that enforce the dimensions of quality. Examples include:

- Completeness Rule: “CustomerID must not be missing.” - Consistency Rule: “If OrderStatus = ‘Cancelled’, then CancelDate must be non‑missing.” - Validity Rule: “PostalCode must match the regex ^[0-9]{5}(-[0-9]{4})?$.” - Uniqueness Rule: “InvoiceNumber must be unique within the fiscal year.”

These rules can be codified in SAS using macro variables and applied across multiple data sets, providing a reusable quality framework.

Data Profiling Tools such as PROC CONTENTS, PROC FREQ, PROC MEANS, and PROC UNIVARIATE each serve a specific purpose. PROC CONTENTS provides a structural overview, including variable names, types, lengths, and labels. PROC FREQ is ideal for categorical variables, revealing frequency counts and detecting unexpected categories. PROC MEANS summarizes numeric variables, delivering count, mean, standard deviation, min, and max. PROC UNIVARIATE offers detailed distribution statistics, including skewness, kurtosis, and quantiles, as well as visualizations like histograms and normal probability plots.

Data Cleaning Functions in SAS extend beyond simple IF‑THEN logic. PROC STDIZE can standardize and impute missing values simultaneously. PROC TRANSPOSE reshapes data, which is often necessary when dealing with wide‑format survey results that need to be converted to long format for analysis. PROC SQL’s CASE expression provides a concise way to recode values within a query.

``` Proc sql; Create table work.Recode_status as Select *, Case When Status = 'A' then 'Active' When Status = 'I' then 'Inactive' Else 'Unknown' End as Status_Desc From work.Raw_status; Quit; ```

Data Preparation Steps typically follow a logical sequence: Extraction, profiling, cleansing, transformation, and loading. Each step builds on the previous one, and errors early in the pipeline can propagate downstream, leading to misleading analytical results. Therefore, robust error handling and logging are essential. SAS provides automatic error checking through the SYSLAST macro variable, and custom logging can be added with the FILE statement to write messages to a dedicated log file.

``` Filename prep_log '/folders/myfolders/prep.Log'; Proc printto log=prep_log new; Run;

/* Data preparation code goes here */

Proc printto; Run; ```

SAS Macros automate repetitive tasks, enforce naming conventions, and centralize logic. A macro that standardizes date variables across multiple tables might look like this:

``` %Macro std_date(lib=, ds=, var=); Data &lib..&Ds._Std; Set &lib..&Ds; &Var._Date = input(&var, yymmdd10.); Format &var._Date date9.; Drop &var; Run; %Mend std_date;

%Std_date(lib=work, ds=sales, var=OrderDate); %Std_date(lib=work, ds=returns, var=ReturnDate); ```

By encapsulating the conversion logic, the macro ensures consistency and reduces the chance of errors.

Performance Tuning becomes critical when managing large data volumes. Techniques include:

- Reducing the size of character variables with appropriate LENGTH statements. - Using WHERE clauses early to limit the number of rows processed. - Applying indexes on key variables before merging. - Leveraging the MEMSIZE and SORTSIZE options to allocate sufficient memory. - Using the COMPRESS= option judiciously to balance I/O speed and CPU overhead.

Example of an early filter:

``` Data work.High_value; Set work.Sales; Where Revenue > 10000; Run; ```

Only rows that meet the condition are read into memory, minimizing unnecessary processing.

Common Challenges in data management and preparation include:

1. Inconsistent Formats: Dates may appear as “2021‑01‑31”, “31/01/2021”, or “Jan‑31‑2021”. Resolving this requires identifying all patterns and applying the appropriate informat or using the ANYDTDTE. Informat, which attempts to parse a variety of date representations.

2. Mixed Data Types: A column intended to be numeric may contain non‑numeric characters, causing SAS to treat the entire variable as character. The solution involves cleaning the column, removing non‑numeric characters with the COMPRESS function, and then converting to numeric with INPUT.

3. Large Volume Processing: When data sets exceed available memory, SAS can spill to disk, but this slows performance. Partitioning the data, processing in chunks, or using SAS Viya’s distributed processing engine can alleviate the bottleneck.

4. Duplicate Records: Duplicate keys may arise from upstream systems that do not enforce uniqueness. Detecting duplicates with PROC SORT NODUPKEY and then deciding whether to keep the first, last, or aggregate duplicate rows is a key decision.

5. Missing Data Patterns: Missingness may be systematic (e.G., A field is missing for a specific region) or random. Understanding the pattern influences the choice of imputation method; for systematic missingness, domain knowledge may suggest a default value or a derived estimate.

6. Data Security and Privacy: Sensitive data must be protected through encryption, access controls, and anonymization. SAS provides data encryption options at the library level and integrates with enterprise security systems for authentication.

7. Version Control: Without proper versioning, analysts may overwrite valuable historical data. Incorporating date stamps into data set names, maintaining a change log, and using source control for SAS code (e.G., Git) are best practices.

8. Documentation Deficit: Poorly documented data pipelines lead to knowledge loss. Maintaining up‑to‑date data dictionaries, flow diagrams, and inline comments in SAS programs mitigates this risk.

Addressing these challenges requires a blend of technical skill, systematic processes, and clear communication among data engineers, analysts, and business stakeholders.

Practical Application Example

Consider a retail company that needs to prepare a data set for a churn prediction model. The raw data resides in three sources:

- An Oracle database containing transaction history (sales table). - A CSV file with customer demographics (customers.Csv). - A web service that provides loyalty program scores (accessed via a REST API).

The preparation steps are:

1. **Extraction**: Use SAS/ACCESS to pull the sales table, PROC IMPORT to read the CSV, and PROC HTTP to call the web service, storing the JSON response in a temporary file.

2. **Profiling**: Run PROC CONTENTS, PROC FREQ, and PROC MEANS on each extracted data set to identify data types, missing values, and outlier patterns.

3. **Cleaning**: Apply the following actions: - Trim and standardize customer names. - Convert all date strings to SAS date values using the ANYDTDTE. Informat. - Impute missing income values with the median income per region. - Remove duplicate transaction records based on TransactionID.

4. **Transformation**: - Create a flag for high‑value customers (total spend > $5,000). - Calculate recency, frequency, and monetary (RFM) metrics using PROC SQL aggregation. - Encode categorical variables (e.G., “PreferredChannel”) using PROC FORMAT.

5. **Loading**: Merge the cleaned transaction data with the demographic data on CustomerID, and then join the loyalty scores using a left join. Finally, write the resulting analytical data set to a compressed SAS library for downstream modeling.

A concise SAS implementation illustrating these steps is shown below:

``` /* 1. Extraction */ Libname oradb oracle user=retail_user password=***** path='sales_db'; Proc sql; Create table work.Sales_raw as Select * From oradb.Sales; Quit;

Proc import datafile='/folders/myfolders/customers.Csv' out=work.Customers_raw Dbms=csv replace; Getnames=yes; Run;

Filename resp temp; Proc http url='https://Api.Loyalty.Com/scores?Date=today' method='GET' out=resp; Run; Libname json json fileref=resp; Proc copy in=json out=work.Loyalty_raw; Run;

/* 2. Profiling */ Proc contents data=work.Sales_raw; run; Proc freq data=work.Customers_raw; tables Gender Region / missing; run; Proc means data=work.Sales_raw n nmiss mean std min max; var Amount; run;

/* 3. Cleaning */ Data work.Customers_clean; Set work.Customers_raw; Name = propcase(strip(Name)); BirthDate = input(BirthDate, anydtdte10.); Format BirthDate date9.; If Income = . Then Income = median(Income) over (partition by Region); Run;

Proc sort data=work.Sales_raw nodupkey out=work.Sales_clean; By TransactionID; Run;

/* 4. Transformation */ Proc sql; Create table work.Rfm as Select CustomerID, Max(OrderDate) as LastPurchase format=date9., Count(*) as Frequency, Sum(Amount) as Monetary format=comma12.2 From work.Sales_clean Group by CustomerID; Quit;

Proc format; Value $channelfmt 'Online' = '1' 'Store' = '2' Other = '0'; Run;

Data work.Rfm_transformed; Set work.

Key takeaways

  • In the Certificate in Data Analytics with SAS, learners encounter a wide range of terminology that describes how raw information is turned into a reliable, analysis‑ready asset.
  • For example, a CSV file located on a network share can be read with a DATA step that uses the INFILE statement, while a SQL Server table can be accessed using PROC SQL together with the SAS/ACCESS Interface to ODBC.
  • A library is created with the LIBNAME statement, and it can refer to a directory on the file system, a database, or a remote server.
  • The library name (mydata) becomes a prefix that allows the programmer to reference any data set within that location, such as mydata.
  • The term observation is interchangeable with record or case, and the term variable aligns with field or column.
  • SAS stores numeric values in floating‑point format, while character values are stored as strings of a defined length.
  • In this example, CustomerID and Name are character variables, while Age and Salary are numeric.
June 2026 intake · open enrolment
from £90 GBP
Enrol