Math · Statistics · Live
P-value calculator —
Z, t, Chi-Square & F tests.
Calculate the exact p-value from any Z, t, chi-square, or F test statistic. Choose one-tailed or two-tailed for Z and t tests, see significance at every standard α level, and compare against critical values, all in real time.
Inputs
Hypothesis test
Can be negative
p-value
Z-test
Significant (p < 0.05)
Not sig.
Not sig.
Significant
Significant
p = 0.0198. For a two-tailed Z-test , the result is statistically significant at α = 0.05. Reject the null hypothesis.
Critical values
Your statistic vs standard thresholds
| Significance level | Critical value | Your stat | Reject H₀? |
|---|---|---|---|
| α = 0.1 | ±1.6449 | 2.33 | Yes ✓ |
| α = 0.05 | ±1.9600 | 2.33 | Yes ✓ |
| α = 0.01 | ±2.5758 | 2.33 | No |
| α = 0.001 | ±3.2905 | 2.33 | No |
Interpretation guide
p = 0.0198 — what does that mean?
p = 0.0198 means: if the null hypothesis (H₀) were true, the probability of observing a test statistic as extreme as |2.33| or higher by chance alone is 1.98%.
Field guide
What the p-value is and what people get wrong about it.
What is a p-value?
A p-value is the probability of observing a test statistic as extreme as (or more extreme than) the one actually observed, assuming the null hypothesis (H₀) is true. It measures how surprising the data are under the null hypothesis — not how likely the null hypothesis is.
Formally: p = P(observing data this extreme | H₀ is true). A small p-value (conventionally p < 0.05) indicates that the observed data would be unlikely if H₀ were true, so we reject H₀. A large p-value means the data are not particularly surprising under H₀, so we fail to reject it.
The most common misconceptions
The p-value is one of the most frequently misinterpreted statistics in all of science. Common wrong interpretations include:
- "p = 0.03 means there is a 3% probability that H₀ is true."Wrong. The p-value says nothing about the probability of H₀. H₀ is either true or false; it doesn't have a probability. The p-value is a conditional probability: P(data | H₀), not P(H₀ | data).
- "p = 0.03 means there is a 97% probability that H₁ is true."Also wrong, for the same reason. Determining P(H₁ | data) requires prior probabilities and uses Bayesian inference, which is different.
- "p > 0.05 means H₀ is true."Failing to reject H₀ is not the same as accepting it. The data may simply be insufficient to detect a real effect.
One-tailed vs two-tailed tests
The choice of tail affects the p-value and should be made beforelooking at the data, based on the research question:
- Two-tailed: H₁ is that the parameter differs from H₀ in either direction. p = P(|T| ≥ |tobs|). Use this when you have no directional hypothesis. It is the more conservative and more common default.
- Right-tailed: H₁ is that the parameter is greater than H₀. p = P(T ≥ tobs).
- Left-tailed: H₁ is that the parameter is less than H₀. p = P(T ≤ tobs).
For a given test statistic, the one-tailed p-value is exactly half the two-tailed p-value (when the statistic is in the hypothesised direction). Switching from two-tailed to one-tailed to cross the 0.05 threshold after seeing the data is a form of p-hacking.
The four test families
Z-test
Used when the population standard deviation σ is known, or when the sample is large enough that the sample standard deviation is a reliable estimate (typically n ≥ 30). The test statistic follows a standard normal distribution N(0, 1) under H₀. Common applications: one-sample and two-sample proportion tests, large-sample means.
t-test
Used when σ is unknown and estimated from the sample, especially for small samples (n < 30). The test statistic follows a Student's t-distribution with (n − 1) degrees of freedom for a one-sample test, (n₁ + n₂ − 2) for an independent two-sample test, or (n − 1) for a paired t-test. The t-distribution has heavier tails than the normal, producing higher p-values for the same test statistic — a conservative correction for the extra uncertainty.
Chi-Square (χ²) test
Used for categorical data. Common applications include goodness-of-fit tests (does observed frequency match a theoretical distribution?), tests of independence (are two categorical variables independent?), and tests of homogeneity. The χ² statistic is always non-negative; the p-value is always the upper-tail probability. Degrees of freedom depend on the specific test: for a goodness-of-fit test with k categories, df = k − 1; for a contingency table with r rows and c columns, df = (r−1)(c−1).
F-test
Used to compare variances or model fit. ANOVA uses the F-test to compare group means by taking the ratio of between-group variance to within-group variance. Regression uses it to test overall model significance. The F statistic is always non-negative; the p-value is the upper-tail probability. Two degrees-of-freedom parameters are required: df₁ (numerator) and df₂ (denominator), which in ANOVA correspond to (k−1) and (N−k) respectively.
The 0.05 threshold — why it is and isn't magic
The α = 0.05 threshold was popularised by R.A. Fisher in the 1920s as a convenient rule of thumb, not a fundamental truth. Over time it has become a de facto publishing gate in many fields, causing significant problems:
- Studies with p = 0.049 and p = 0.051 are treated radically differently despite being statistically indistinguishable.
- Publication bias toward p < 0.05 results inflates the literature with false positives.
- Effect size and confidence intervals are often more informative than the binary significant/not-significant classification.
Modern statistical practice increasingly recommends reporting exact p-values, effect sizes, and confidence intervals rather than binary significance labels.