Data Exploration and Visualization

Data set – In SAS a data set is a collection of rows, called observations, and columns, called variables. Each observation represents a single case, such as a customer, a transaction, or a measurement taken at a specific time. Variables hol…

Data Exploration and Visualization

Data set – In SAS a data set is a collection of rows, called observations, and columns, called variables. Each observation represents a single case, such as a customer, a transaction, or a measurement taken at a specific time. Variables hold the attributes of the observation, for example age, salary, or purchase_date. Understanding the structure of a data set is the first step in any exploration because it determines which analytical techniques are appropriate. For instance, a data set that contains only categorical variables will be examined with frequency tables and chi‑square tests, whereas a data set that contains many numeric variables may be explored with correlation matrices and scatter plots.

Variable – A variable in SAS is analogous to a column in a spreadsheet. Variables can be numeric or character. Numeric variables store numbers and may be used directly in arithmetic calculations, statistical tests, and graphical axes. Character variables store text strings and are often used for identifiers, categories, or dates that need to be parsed. SAS automatically assigns a length to character variables; if the length is too short, values can be truncated, leading to data quality problems. A common challenge is to identify variables that have been mistakenly imported as character when they should be numeric, such as a zip code that contains only digits but is stored as text.

Observation – An observation is a single row in a SAS data set. It is sometimes called a case, record, or unit of analysis. Each observation contains a value for every variable in the data set, though some values may be missing. In practice, analysts frequently filter observations based on criteria such as date ranges, geographic regions, or customer segments to reduce the size of the data set and focus the exploration on relevant subsets.

Missing value – In SAS a missing numeric value is represented by a period (.) And a missing character value by a blank string. Missing values can arise from non‑responses in surveys, data entry errors, or failures in data extraction processes. Detecting the pattern of missingness is essential because it influences the choice of imputation methods. For example, if a variable has less than 5 % missing and the missingness appears random, simple mean imputation may be acceptable; however, if the missingness is systematic (e.G., Higher income respondents are more likely to skip a question), more sophisticated techniques such as multiple imputation or model‑based approaches are required.

Data cleaning – Data cleaning is the systematic process of detecting and correcting (or removing) inaccurate records from a data set. Typical tasks include removing duplicate observations, correcting inconsistent coding (e.G., “NY”, “N.Y.”, And “New York” for the same state), handling out‑of‑range values, and standardizing date formats. SAS provides a variety of functions—such as compress, substr, and input—that are useful for cleaning character variables, while numeric variables can be validated using logical expressions. The challenge in data cleaning is to balance thoroughness with the risk of inadvertently discarding valid data.

Outlier – An outlier is an observation that deviates markedly from the majority of the data. Outliers can be legitimate extreme values, data entry errors, or measurement anomalies. Identifying outliers often begins with visual tools such as boxplots, which display the interquartile range and flag points beyond the whiskers as potential outliers. Statistical rules, such as values that lie more than 1.5 × IQR above the third quartile, provide a formal definition, but analysts must also consider domain knowledge. In some cases, outliers are removed before modeling to improve model fit; in other situations, they are retained because they carry important information (e.G., Fraud detection).

Descriptive statistics – Descriptive statistics summarize the central tendency, dispersion, and shape of a variable’s distribution. Common measures include the mean, median, mode, variance, standard deviation, range, and interquartile range. SAS procedures such as PROC MEANS and PROC UNIVARIATE generate these statistics automatically. For example, PROC MEANS can be instructed to compute the mean, standard deviation, minimum, and maximum for every numeric variable in a data set, while PROC UNIVARIATE provides additional measures of skewness and kurtosis. Descriptive statistics are the foundation for more advanced analyses because they reveal potential data quality issues (e.G., An unusually high standard deviation may indicate a data entry error).

Frequency distribution – A frequency distribution tabulates the count (and optionally the percentage) of each distinct value of a categorical variable. In SAS, PROC FREQ is the primary tool for creating frequency tables. For example, PROC FREQ can produce a table that shows how many customers belong to each loyalty tier (Bronze, Silver, Gold). Frequency distributions help analysts understand the composition of categorical variables, detect rare categories that may need to be combined, and assess whether the data set is balanced for modeling purposes.

Histogram – A histogram visualizes the distribution of a numeric variable by grouping values into bins and displaying the count or density of observations in each bin. SAS creates histograms using PROC SGPLOT with the HISTOGRAM statement, or via PROC UNIVARIATE with the HISTOGRAM option. The choice of bin width influences the appearance of the histogram: Too few bins may obscure important features, while too many bins can create a noisy plot. Analysts often experiment with different binning strategies to reveal patterns such as multimodality, skewness, or gaps.

Boxplot – A boxplot (or box‑and‑whisker plot) summarizes the five‑number summary of a numeric variable: Minimum, first quartile, median, third quartile, and maximum. Outliers are typically plotted as individual points beyond the whiskers. In SAS, PROC SGPLOT with the VBOX or HBOX statement creates vertical or horizontal boxplots, respectively. Boxplots are especially useful for comparing distributions across groups; for example, a boxplot of monthly sales by region can quickly highlight which region has the highest median sales and which regions exhibit the most variability.

Scatter plot – A scatter plot displays the relationship between two numeric variables by plotting each observation as a point on a two‑dimensional grid. SAS generates scatter plots using PROC SGPLOT with the SCATTER statement. Adding a regression line (via the REG statement) or a smoothing curve (via the LOESS statement) helps to visualize the underlying trend. Scatter plots are fundamental for detecting linear or nonlinear associations, clusters, and potential outliers. When the data set contains a categorical grouping variable, the GROUP= option can color‑code points, allowing analysts to see how groups differ in the bivariate relationship.

Line chart – A line chart connects points representing a numeric variable over an ordered dimension, such as time. In SAS, PROC SGPLOT with the SERIES statement produces line charts. Line charts are ideal for visualizing trends, seasonality, and abrupt changes. For example, a line chart of daily website visits over a year can reveal peaks during promotional campaigns and troughs during holidays. Adding multiple series to the same chart (using the GROUP= option) enables comparison of trends across categories, such as sales by product line.

Bar chart – A bar chart displays the frequency or aggregate value of categorical variables using rectangular bars. SAS creates bar charts with PROC SGPLOT and the VBAR or HBAR statements. Bars can represent raw counts, percentages, or summary statistics such as the mean of a numeric variable within each category. For instance, a vertical bar chart of average order value by customer segment immediately shows which segment contributes the most revenue. Grouped or stacked bar charts allow analysts to compare multiple dimensions simultaneously, but careful design is needed to avoid visual clutter.

Pie chart – A pie chart partitions a circle into slices that represent the proportion of each category relative to the whole. Although SAS can produce pie charts using PROC GCHART, many data‑visualization best practices discourage their use because it is difficult for the human eye to compare angles accurately. When a pie chart is employed, it should be limited to a small number of categories (typically fewer than five) and used only when the goal is to emphasize the share of a single dominant category.

Heat map – A heat map uses color intensity to represent the magnitude of a numeric variable across two dimensions, often a matrix of categories. SAS’s PROC SGPLOT with the HEATMAP statement produces heat maps. A common application is a correlation heat map, where each cell shows the correlation coefficient between two variables and the color reflects the strength and direction of the relationship. Heat maps are powerful for spotting patterns, such as groups of variables that move together, but they require a well‑chosen color palette to avoid misinterpretation.

Correlation – Correlation quantifies the strength and direction of a linear relationship between two numeric variables. The most widely used metric is the Pearson correlation coefficient, which ranges from –1 (perfect negative linear relationship) to +1 (perfect positive linear relationship). SAS computes Pearson correlations with PROC CORR, which also provides p‑values for testing the null hypothesis of zero correlation. When variables are ordinal or non‑normally distributed, the Spearman rank correlation may be more appropriate; PROC CORR can calculate Spearman coefficients by specifying the SPEARMAN option. Understanding correlation is essential before building regression models because highly correlated predictors can cause multicollinearity, inflating standard errors and destabilizing coefficient estimates.

Covariance – Covariance measures the joint variability of two numeric variables. Unlike correlation, covariance retains the original units of measurement, making it less interpretable when variables have different scales. In SAS, PROC CORR with the COV option outputs the covariance matrix. Covariance is a building block for many multivariate techniques, including principal component analysis (PCA) and multivariate normal modeling.

Regression – Regression models describe the relationship between a dependent (response) variable and one or more independent (predictor) variables. The simplest form is simple linear regression, where a single predictor explains the variation in a continuous response. SAS implements regression through PROC REG for ordinary least‑squares models, and PROC GLM for general linear models that can handle categorical predictors via class variables. Regression analysis provides estimates of coefficients, confidence intervals, hypothesis tests, and diagnostics such as residual plots. Practical applications include predicting sales based on advertising spend, estimating the effect of temperature on equipment failure rates, and modeling the relationship between education level and income.

Logistic regression – Logistic regression is used when the response variable is binary (e.G., Churn = 1/0). Instead of modeling the response directly, logistic regression models the log‑odds of the event occurring. SAS’s PROC LOGISTIC fits logistic models and supplies odds ratios, model‑fit statistics (e.G., AIC, Hosmer‑Lemeshow test), and classification tables. A common challenge is handling imbalanced data where the event of interest is rare; techniques such as oversampling, undersampling, or applying a cost‑sensitive loss function can improve predictive performance.

Classification tree – A classification tree partitions the predictor space into rectangular regions by recursively splitting on variable thresholds. Each terminal node (leaf) assigns a class label based on the majority class of the observations it contains. SAS’s PROC TREESPLIT and PROC HPFOREST (for random forests) generate tree‑based models. Trees are attractive because they are easy to interpret, can handle mixed data types, and automatically capture nonlinear relationships. However, single trees are prone to overfitting; ensemble methods such as random forests or gradient boosting mitigate this risk by aggregating many trees.

Clustering – Clustering groups observations into subsets (clusters) such that objects within a cluster are more similar to each other than to objects in other clusters. SAS provides several clustering algorithms, including hierarchical clustering (PROC CLUSTER), k‑means clustering (PROC FASTCLUS), and density‑based clustering (PROC DBSCAN). The choice of algorithm depends on data size, shape of clusters, and the need for interpretability. A practical use case is market segmentation, where customers are clustered based on purchase history and demographics to create targeted marketing campaigns. Challenges include selecting the appropriate number of clusters and scaling variables so that no single variable dominates the distance calculations.

Principal component analysis – PCA reduces dimensionality by transforming a set of correlated variables into a smaller set of uncorrelated components that capture most of the variance. SAS’s PROC PRINCOMP computes eigenvalues and eigenvectors of the covariance or correlation matrix, producing component scores that can be used for visualization or as inputs to downstream models. PCA is valuable when dealing with high‑dimensional data such as sensor readings or questionnaire items, because it simplifies the data while preserving the essential structure. A common pitfall is interpreting components without examining the loadings; the loadings reveal which original variables contribute most to each component.

Dimensionality reduction – Beyond PCA, dimensionality reduction encompasses techniques such as factor analysis, t‑distributed stochastic neighbor embedding (t‑SNE), and uniform manifold approximation and projection (UMAP). SAS’s PROC FACTOR performs factor analysis, while PROC TSNE (available in SAS Viya) implements t‑SNE. These methods are used to visualize high‑dimensional data in two or three dimensions, often revealing clusters or outliers that are not apparent in the original space. The challenge is that many dimensionality‑reduction algorithms are stochastic; repeated runs can produce different visualizations, so analysts must set random seeds for reproducibility.

Data transformation – Data transformation changes the scale or distribution of a variable to meet analytical assumptions or improve model performance. Common transformations include logarithmic, square‑root, and Box‑Cox transformations. SAS provides functions such as LOG, SQRT, and BOXCOX (via PROC TRANSREG) to apply these changes. For example, a positively skewed income variable may be log‑transformed to approximate normality, which is a prerequisite for many parametric tests. Care must be taken to interpret model results on the transformed scale and, when presenting results to business stakeholders, to back‑transform predictions to the original units.

Aggregation – Aggregation summarizes data at a higher level of granularity, such as computing total sales per month or average rating per product. SAS’s PROC SUMMARY and PROC MEANS with the CLASS statement perform aggregation efficiently. The output can be stored in an output data set using the OUT= option, enabling further analysis or visualization. Aggregation is indispensable for creating dashboards that display key performance indicators (KPIs) at various hierarchical levels (e.G., Company, region, store).

Pivot – Pivoting restructures data from a long format (one row per observation) to a wide format (one row per entity with multiple columns). SAS’s PROC TRANSPOSE accomplishes this operation. For instance, a data set of monthly sales records can be pivoted so that each store appears once with separate columns for each month’s sales. Pivoting is useful for preparing data for certain modeling techniques that require a fixed set of predictor columns. The main difficulty lies in handling missing values that arise when some entities lack data for particular time periods.

Cross‑tabulation – A cross‑tabulation (or contingency table) displays the joint frequency distribution of two categorical variables. PROC FREQ with the TABLES statement produces cross‑tabs, and the CHISQ option adds chi‑square statistics for testing independence. Cross‑tabs are frequently used in market research to explore relationships such as gender versus product preference. When the table becomes large (many categories on each axis), the cell counts may be sparse, violating chi‑square assumptions; in such cases, Fisher’s exact test (available via the EXACT option) is a more reliable alternative.

Data profiling – Data profiling is the systematic examination of data to assess its quality, structure, and content. SAS’s PROC CONTENTS provides metadata such as variable types, lengths, and labels, while PROC MEANS and PROC FREQ generate summary statistics that reveal anomalies. Data profiling helps answer questions like: “What proportion of records have missing values?”, “Are there unexpected negative ages?”, And “Do date fields follow a consistent format?”. The output of data profiling guides the subsequent cleaning and transformation steps.

Data governance – Data governance encompasses the policies, procedures, and standards that ensure data is accurate, secure, and used responsibly. In the context of data exploration, governance dictates who may access the data, how data lineage is documented, and what naming conventions must be followed. SAS Metadata Server stores information about data sets, libraries, and user permissions, enabling auditors to trace the origin of a data set back to its source system. A common governance challenge is balancing the need for rapid exploration with compliance requirements, especially when dealing with personally identifiable information (PII).

Data lineage – Data lineage tracks the flow of data from its original source through each transformation, aggregation, and analysis step. In SAS, the ODS (Output Delivery System) trace files can capture the sequence of procedures executed, while SAS Enterprise Guide automatically records a process flow diagram. Maintaining clear lineage is crucial for reproducibility; if a model’s performance declines, analysts can revisit earlier steps to identify whether a change in data cleaning or variable creation introduced the issue.

Data storytelling – Data storytelling combines visualizations, narrative, and context to convey insights in a compelling way. Effective storytelling selects the right chart type for each message, uses annotations to highlight key points, and structures the presentation to guide the audience from problem definition through analysis to recommendation. SAS Visual Analytics supports interactive dashboards where users can drill down from high‑level KPIs to detailed tables. The main challenge is to avoid overwhelming the audience with too many visual elements while still providing enough depth for decision makers to act.

Dashboard – A dashboard is a collection of visual components—charts, tables, gauges, and filters—arranged on a single screen to provide an at‑a‑glance view of business performance. SAS Visual Analytics offers drag‑and‑drop dashboard creation, allowing analysts to embed dynamic elements such as date range selectors that automatically refresh all visualizations. Best practices for dashboards include using a limited color palette, aligning charts for visual consistency, and prioritizing the most important metrics at the top of the layout.

Interactive visualization – Interactive visualizations enable users to explore data by hovering, clicking, or selecting subsets. SAS Visual Analytics, SAS Studio, and SAS Enterprise Guide each support interactivity through ODS Graphics and JavaScript integration. For example, a heat map of sales by region can be linked to a detail table that updates when a user clicks on a specific region. Interactivity enhances discovery because analysts can test hypotheses on the fly without writing additional code.

ODS (Output Delivery System) – ODS is SAS’s mechanism for directing output to various destinations such as HTML, PDF, RTF, or Excel. ODS also controls graphic output via the ODS Graphics engine, which produces high‑quality visualizations. A typical workflow starts with ODS HTML to generate an interactive report for web viewing, then switches to ODS PDF for a printable version. Understanding ODS options such as STYLE=, IMAGEFMT=, and DEVICE= is essential for customizing the appearance of tables and graphs.

ODS Graphics – ODS Graphics is the component of ODS that creates statistical graphics. It replaces older SAS/GRAPH procedures and provides a unified syntax across many procedures. For instance, PROC SGPLOT, PROC SGPANEL, and PROC SGSCATTER all produce ODS Graphics output. The engine supports features such as regression lines, confidence bands, and annotation layers. A common pitfall is that default graph sizes may be too small for detailed dashboards; adjusting the ODS GRAPHICS / IMAGENAME= and HEIGHT= options resolves this issue.

SAS/STAT – SAS/STAT is the statistical analysis library that contains procedures for regression, survival analysis, mixed models, and more. For data exploration, PROC CORR, PROC UNIVARIATE, PROC FREQ, and PROC MEANS belong to SAS/STAT. The library also includes advanced modeling tools such as PROC GLIMMIX for generalized linear mixed models and PROC SURVEYREG for complex survey designs. Knowing which SAS/STAT procedures support ODS Graphics enables analysts to produce visual output directly from statistical analyses.

SAS/GRAPH – SAS/GRAPH is the legacy graphics library that includes procedures like PROC GCHART, PROC GPLOT, and PROC G3D. Although newer ODS Graphics procedures are preferred, SAS/GRAPH remains useful for specialized plots such as radar charts or custom annotation maps. When using SAS/GRAPH, analysts must manage the GOPTIONS statements to control aspects like device type and image resolution. Transitioning from SAS/GRAPH to ODS Graphics can improve consistency across reports.

SAS Visual Analytics – SAS Visual Analytics is a web‑based platform that provides self‑service data exploration, advanced analytics, and interactive reporting. It integrates with SAS Cloud Analytic Services (CAS) to handle large data volumes in memory. Key features include drag‑and‑drop visual authoring, auto‑generated insights (e.G., “Top 5 drivers of churn”), and the ability to embed predictive models built in SAS STAT directly into dashboards. A practical application is creating a real‑time sales performance dashboard that updates as new transaction data streams into CAS.

SAS Enterprise Guide – SAS Enterprise Guide is a point‑and‑click interface for building SAS programs, running procedures, and creating visualizations. It records each step in a process flow, facilitating reproducibility and collaboration. For data exploration, analysts can quickly generate frequency tables, histograms, and scatter plots by selecting menu options, while the underlying SAS code is automatically generated and can be edited for customization. Enterprise Guide also integrates with ODS to export results to Excel or PowerPoint.

SAS Studio – SAS Studio is a browser‑based development environment that provides a code editor, task templates, and interactive output windows. It supports ODS Graphics and includes built‑in tasks for data profiling, data preparation, and visualization. SAS Studio is ideal for learners because it requires no client installation and offers a guided interface for common exploratory steps, such as “Explore Data” tasks that automatically produce histograms, boxplots, and correlation matrices.

PROC SGPLOTPROC SGPLOT is the workhorse for creating modern statistical graphics in SAS. It supports a wide range of plot types—scatter, line, bar, histogram, boxplot, heatmap, and more—through simple statements. Example:

``` Proc sgplot data=work.Sales; Histogram revenue / nbins=20; Density revenue / type=normal; Run; ```

This code produces a histogram of the variable revenue with a superimposed normal density curve. The flexibility of PROC SGPLOT allows analysts to layer multiple plot types (e.G., A scatter plot with a regression line) in a single figure, which is valuable for exploratory analysis.

PROC UNIVARIATEPROC UNIVARIATE provides a comprehensive view of a single numeric variable, delivering descriptive statistics, tests for normality, and a suite of plots. An example call:

``` Proc univariate data=work.Customers; Var age; Histogram age / normal; Inset mean std / position=topright; Run; ```

The procedure outputs a histogram with a fitted normal curve, a Q‑Q plot, and an inset showing the mean and standard deviation. The normality test (e.G., Shapiro‑Wilk) helps decide whether transformations are needed before applying parametric models.

PROC FREQPROC FREQ creates frequency tables for categorical variables and can compute chi‑square tests, measures of association, and confidence intervals for proportions. Example:

``` Proc freq data=work.Survey; Tables gender*purchase_intent / chisq measures; Run; ```

This generates a cross‑tabulation of gender by purchase intent, along with chi‑square statistics and Cramer’s V to assess the strength of association. The output can be directed to ODS HTML for an interactive table that allows sorting and filtering.

PROC CORRPROC CORR calculates Pearson, Spearman, and Kendall correlation coefficients, and can produce a correlation matrix with associated p‑values. It also offers the NOPRINT option to store the correlation matrix in a data set for downstream processing. Example:

``` Proc corr data=work.Financials noprob outp=work.Corrout; Var revenue profit margin; Run; ```

The resulting data set corrout contains the correlation coefficients, which can be visualized using a heat map in PROC SGPLOT.

PROC MEANSPROC MEANS computes descriptive statistics for numeric variables, optionally grouped by one or more class variables. It is frequently used for rapid aggregation. Example:

``` Proc means data=work.Sales n mean std min max; Class region quarter; Var sales_amount; Run; ```

The output shows the number of observations, mean, standard deviation, minimum, and maximum sales amount for each region‑quarter combination.

PROC SUMMARYPROC SUMMARY is similar to PROC MEANS but is primarily used to create summarized data sets rather than printed output. The OUTPUT OUT= statement stores the aggregated results, which can then be merged with other data or visualized. Example:

``` Proc summary data=work.Transactions nway; Class store month; Var total_price; Output out=work.Monthly_sales sum=monthly_total; Run; ```

The resulting data set monthly_sales contains one row per store‑month with the total sales amount, ready for a time‑series line chart.

PROC TRANSPOSEPROC TRANSPOSE reshapes data from long to wide format or vice versa. When pivoting sales data, the following code creates a column for each month:

``` Proc transpose data=work.Monthly_sales out=work.Wide_sales; By store; Id month; Var monthly_total; Run; ```

The ID statement defines the new column names based on the values of month, while the VAR statement specifies the value to be placed in each cell.

PROC CLUSTERPROC CLUSTER performs hierarchical clustering using methods such as Ward’s, single linkage, or complete linkage. The procedure produces a dendrogram that visualizes how observations merge at each distance threshold. Example:

``` Proc cluster data=work.Customers method=ward; Var age income spend_score; Id customer_id; Run; ```

The resulting tree can be cut at a desired number of clusters using PROC TREE, which assigns a cluster label to each observation.

PROC FASTCLUSPROC FASTCLUS implements the k‑means algorithm, efficiently handling large data sets. The user must specify the number of clusters (K=) and optionally an initial seed. Example:

``` Proc fastclus data=work.Customers maxclusters=5 out=work.Clustered; Var age income spend_score; Run; ```

The output data set clustered contains a variable Cluster that indicates the assigned cluster for each customer.

PROC HPFORESTPROC HPFOREST builds a random forest model using the high‑performance (HP) analytics engine in SAS Viya. Random forests combine many decision trees to improve predictive accuracy and provide variable importance measures. Example:

``` Proc hpforest data=work.Training; Target churn; Input age income spend_score / level=interval; Input gender / level=nominal; Ntree=200; Output out=work.Rf_pred; Run; ```

The procedure outputs predicted probabilities and a ranking of variables based on their contribution to reducing impurity.

PROC SGPANELPROC SGPANEL creates panelled visualizations, where a separate subplot is drawn for each level of a grouping variable. This is useful for comparing distributions across categories. Example:

``` Proc sgpanel data=work.Sales; Panelby region / layout=rowlattice; Histogram revenue / nbins=15; Run; ```

Each region receives its own histogram, allowing analysts to spot regional differences in revenue distribution.

PROC SGSCATTERPROC SGSCATTER produces a matrix of scatter plots for multiple pairs of variables, often accompanied by correlation coefficients. It is a quick way to explore pairwise relationships in a multivariate data set. Example:

``` Proc sgscatter data=work.Financials; Matrix revenue profit margin / diagonal=(histogram); Run; ```

The diagonal displays histograms of each variable, while the off‑diagonal cells contain scatter plots for every variable pair.

PROC PRINTPROC PRINT displays the contents of a data set in tabular form. While simple, it is frequently used during exploration to verify that data transformations have been applied correctly. Adding the OBS= option limits the number of rows shown, preventing overwhelming output. Example:

``` Proc print data=work.Sample_obs(obs=10); Var customer_id age gender income; Run; ```

This prints the first ten observations, providing a quick sanity check.

PROC CONTENTSPROC CONTENTS provides metadata about a SAS data set, including variable names, types, lengths, and labels. It is a valuable first step in exploration because it reveals hidden issues such as variables stored with an incorrect length or missing labels that can impede interpretation. Example:

``` Proc contents data=work.Sales varnum; Run; ```

The VARNUM option orders the output by the position of variables in the data set, mirroring the order seen in the data view.

PROC SQLPROC SQL enables relational database queries directly within SAS, allowing analysts to join tables, filter rows, and compute aggregates using standard SQL syntax. For data exploration, PROC SQL can be employed to quickly create summary tables without writing multiple data step merges. Example:

``` Proc sql; Create table work.Sales_summary as Select region, sum(revenue) as total_rev, avg(discount) as avg_disc From work.Sales Group by region; Quit; ```

The resulting table can be visualized with a bar chart to compare total revenue across regions.

PROC TRANSREGPROC TRANSREG performs regression‑based transformations, including Box‑Cox, Yeo‑Johnson, and monotonic transformations. This procedure is useful when a variable violates normality assumptions and a parametric transformation is needed. Example:

``` Proc transreg data=work.Sales; Model boxcox(revenue) = identity; Run; ```

The output suggests the optimal Box‑Cox lambda parameter, which can be applied using the BOXCOX function in a subsequent data step.

PROC HPFORECASTPROC HPFORECAST generates time‑series forecasts using high‑performance algorithms in SAS Viya. It supports exponential smoothing, ARIMA, and automatic model selection. Example:

``` Proc hpforecast data=work.Monthly_sales lead=12; Id month interval=month; Var monthly_total; Model method=auto; Output out=work.Forecast; Run; ```

The output data set includes point forecasts and prediction intervals for the next twelve months, which can be plotted with a line chart to visualize the forecast trajectory.

PROC HPCLUSPROC HPCLUS performs high‑performance clustering on large data sets using algorithms such as k‑means and hierarchical clustering, leveraging the in‑memory capabilities of CAS. Example:

``` Proc hpclus data=work.Large_customers method=kmeans k=4 out=work.Hp_cluster; Var age income spend_score; Run; ```

The procedure scales efficiently to millions of rows, making it suitable for big‑data segmentation projects.

PROC HPFOREST – (Repeated for emphasis) In addition to classification, PROC HPFOREST can be used for regression by specifying a numeric target variable. The algorithm automatically handles missing values and provides out‑of‑bag error estimates, which serve as an internal validation metric.

PROC SGPLOT with GROUP= option – Adding the GROUP= option to a PROC SGPLOT statement enables color‑coding of points or bars based on a categorical variable.

Key takeaways

  • Understanding the structure of a data set is the first step in any exploration because it determines which analytical techniques are appropriate.
  • A common challenge is to identify variables that have been mistakenly imported as character when they should be numeric, such as a zip code that contains only digits but is stored as text.
  • In practice, analysts frequently filter observations based on criteria such as date ranges, geographic regions, or customer segments to reduce the size of the data set and focus the exploration on relevant subsets.
  • For example, if a variable has less than 5 % missing and the missingness appears random, simple mean imputation may be acceptable; however, if the missingness is systematic (e.
  • SAS provides a variety of functions—such as compress, substr, and input—that are useful for cleaning character variables, while numeric variables can be validated using logical expressions.
  • Identifying outliers often begins with visual tools such as boxplots, which display the interquartile range and flag points beyond the whiskers as potential outliers.
  • For example, PROC MEANS can be instructed to compute the mean, standard deviation, minimum, and maximum for every numeric variable in a data set, while PROC UNIVARIATE provides additional measures of skewness and kurtosis.
June 2026 intake · open enrolment
from £90 GBP
Enrol